 Stochastic contextfree grammar

A stochastic contextfree grammar (SCFG; also probabilistic contextfree grammar, PCFG) is a contextfree grammar in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others. SCFGs extend contextfree grammars in the same way that hidden Markov models extend regular grammars. SCFGs have application in areas as diverse as Natural language processing to the study of RNA molecules. SCFGs are a specialized form of weighted contextfree grammars.
Contents
Techniques
A variant of the CYK algorithm finds the Viterbi parse of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.
The InsideOutside algorithm is an analogue of the ForwardBackward algorithm, and can be used to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFG. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar.
The Inside/Outside algorithms can also be used to compute the probabilities that a given production will be used in a random derivation of a sequence. This is used as part of an Expectationmaximization algorithm to learn the maximum likelihood probabilities for an SCFG based on a set of training sequences that the SCFG should model. ^{[1]} The algorithm is analogous to that used by hidden Markov models.
Applications
Natural language processing
Contextfree grammars were originally conceived in an attempt to model natural languages, i.e. those normally spoken by humans.^{[2]} Such grammars were originally conceived as sets of rules (productions), typically specified in a syntax such as BackusNaur Form, where the rules are absolute; such grammars are still appropriate for numerous situations (notably, the design of programming languages). Probabilistic CFGs are an attempt to deal with difficulties encountered when attempting to extend CFGs to deal with the complexities of natural languages. Specifically, one often finds that more than one production rule may apply to a sequence of words, thus resulting in a conflict: an ambiguous parse. This happens for several reasons, such as homographs  identically spelled words with multiple meanings  so that the same word sequence can have more than one interpretation. A pun is an example of an ambiguous parse used deliberately to humorous effect, such as in the newspaper headline "Iraqi Head Seeks Arms".
One strategy of dealing with ambiguous parses (originating with grammarians as early as Pāṇini) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become hard to manage. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by using the frequency of various productions to order them, resulting in a "most likely" (winnertakeall) interpretation. As usage patterns are altered in diachronic shifts, these probabilistic rules can be relearned, thus upgrading the grammar.
One may construct a probabilistic grammar from a traditional formal syntax by assigning each nonterminal a probability taken from some distribution, to be eventually estimated from usage data. On most samples of broad language, probabilistic grammars that tune these probabilities from data typically outperform handcrafted grammars (although some rulebased grammars are now approaching the accuracies of PCFG).
Here is a tiny example of a 2rule PCFG grammar. Each rule is preceded by a probability that reflects the relative frequency with which it occurs.
 0.7 VP → V NP
 0.3 VP → V NP NP
Given this grammar, we can now say that the number of NPs expected while deriving VPs is 0.7 x 1 + 0.3 x 2 = 1.3.
The probabilities are typically computed by machine learning programs that operate (learn or "train") on large volumes of annotated text ("corpora"), such as the Penn TreeBank, which contains a large collection of articles from the Wall Street Journal. As such, a probabilistic grammar's validity is constrained by context of the text used to train it; the definition of "most likely" is likely to vary with the context of discourse.
An example of a probabilistic CFG parser that has been trained using the Penn Treebank is the Stanford Statistical Parser of Klein and Manning^{[3]}, downloadable at http://nlp.stanford.edu/software/lexparser.shtml
In particular, some speech recognition systems use SCFGs to improve their probability estimate and thereby their performance.^{[4]}
RNA
Contextfree grammars are adept at modeling the secondary structure of RNA^{[5]}^{[6]} Secondary structure involves nucleotides within a singlestranded RNA molecule that are complementary to each other, and therefore base pair. This base pairing is biologically important to the proper function of the RNA molecule. Much of this base pairing can be represented in a contextfree grammar (the major exception being pseudoknots).
For example, consider the following grammar, where a,c,g,u represents nucleotides and S is the start symbol (and only nonterminal):
 S → aSu  cSg  gSc  uSa
This simple CFG represents an RNA molecule consisting entirely of two wholly complementary regions, in which only canonical complementary pairs are allowed (i.e. AU and CG).
By attaching probabilities to more sophisticated CFGs, it is possible to model bases or base pairings that are more or less consistent with an expected RNA molecule pattern. SCFGs are used to model the patterns in RNA gene families in the Rfam Database, and search genome sequences for likely additional members of these families. SCFGs have also been used to find RNA genes using comparative genomics. In this work, homologs of a potential RNA gene in two related organisms were inspected using SCFG techniques to see if their secondary structure is conserved. If it is, the sequence is likely to be an RNA gene, and the secondary structure is presumed to be conserved because of the functional needs of that RNA gene. It has been shown that SCFGs could predict the secondary structure of an RNA molecule similarly to existing techniques: such SCFGs are used, for example, by the Stemloc program.
Linguistics and Psycholinguistics
With the publication of Gold's Theorem^{[7]} 1967 it was claimed that grammars for natural languages governed by deterministic rules could not be learned based on positive instances alone. This was part of the argument from the poverty of stimulus, presented in 1980^{[8]} and implicit since the early works by Chomsky of the 1950s. Among other arguments, this led to the nativist (but still unproven) view, that a form of grammar, including a complete conceptual lexicon in certain versions, were hardwired from birth. However, Gold's learnability result can easily be circumvented, if we either assume the learner learns a nearperfect approximation to the correct language or that the learner learns from typical input rather than arbitrarily devious input. In fact, it has been proved that simply receiving input from a speaker who produces positive instances randomly rather than according to a prespecified devious plan leads to identifiability in the limit with probability 1.^{[9]}^{[10]}
Recently, probabilistic grammars appear to have gained some cognitive plausibility. It is well known that there are degrees of difficulty in accessing different syntactic structures (e.g. the Accessibility Hierarchy for relative clauses). Probabilistic versions of Minimalist Grammars have been used to compute informationtheoretic entropy values which appear to correlate well with psycholinguistic data on understandability and production difficulty.^{[11]}
See also
References
 ^ Aycinena, M.A., (2005,) (PhD thesis). Probabilistic geometric grammars for object recognition, (Thesis).
 ^ Noam Chomsky: Syntactic Structures. Mouton & Co. Publishers, Den Haag, Netherlands, 1957
 ^ Klein, Daniel; Manning, Christopher (2003). "Accurate Unlexicalized Parsing". Proceedings of the 41st Meeting of the Association for Computational Linguistics: 423–430.. http://nlp.stanford.edu/~manning/papers/unlexicalizedparsing.pdf.
 ^ Beutler, R; Kaufman, T; Pfister, B. (2005). "Integrating a nonprobabilistic grammar into large vocabulary continuous speech recognition". IEEE Workshop on Automated Speech recognition and Understanding: 104–109. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1566496.
 ^ R. Durbin, S. Eddy, A. Krogh, G. Mitchinson, ed (1998). Biological sequence analysis : probabilistic models of proteins and nucleic acids. Cambridge University Press. ISBN 9780521629713. http://books.google.com/books?id=R5P2GlJvigQC. This bioinformatics textbook includes an accessible introduction to the use of SCFGs for modelling RNA, as well as the history of this application to 1998.
 ^ Eddy SR, Durbin R (June 1994). "RNA sequence analysis using covariance models". Nucleic Acids Res. 22 (11): 2079–88. doi:10.1093/nar/22.11.2079. PMC 308124. PMID 8029015. http://nar.oxfordjournals.org/cgi/pmidlookup?view=long&pmid=8029015.
 ^ Gold, E. (1967). "Language identification in the limit". Information and Control 10 (5): 447–474. doi:10.1016/S00199958(67)911655.
 ^ Chomsky, N. (1980). Rules and representations. Oxford: Basil Blackwell.
 ^ Clark, A. (2001). "Unsupervised Language Acquisition: Theory and Practice" (PhD thesis). arXiv:cs.CL/0212024 [cs.CL].
 ^ Horning, J.J. (1969). A study of grammatical inference (Ph.D. thesis). Computer Science Department, Stanford University.
 ^ John Hale (2006). "Uncertainty About the Rest of the Sentence". Cognitive Science 30 (4): 643–672. doi:10.1207/s15516709cog0000_64.
Further reading
 Rivas E, Eddy SR (2001). "Noncoding RNA gene detection using comparative sequence analysis". BMC Bioinformatics 2: 8. doi:10.1186/1471210528. PMC 64605. PMID 11801179. http://www.biomedcentral.com/14712105/2/8.
External links
Categories: Bioinformatics
 Formal languages
 Natural language parsing
 Statistical natural language processing
 Probabilistic models
Wikimedia Foundation. 2010.
Look at other dictionaries:
Contextfree grammar — In formal language theory, a context free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). The… … Wikipedia
Weighted contextfree grammar — A weighted context free grammar (WCFG) is a context free grammar where each production has a numeric weight associated with it. The weight of a parse tree in a WCFG is the weight of the rule used to produce the top node, plus the weights of its… … Wikipedia
Context free — may refer to: Concepts context free grammar deterministic context free grammar stochastic context free grammar weighted context free grammar context free language Software Context Free, a computer language for context free grammars … Wikipedia
Stochastic grammar — A stochastic grammar (statistical grammar) is a grammar framework with a probabilistic notion of grammaticality: *Stochastic context free grammar *Statistical parsing *Data oriented parsing *Hidden Markov model *Estimation theoryStatistical… … Wikipedia
Generative grammar — In theoretical linguistics, generative grammar refers to a particular approach to the study of syntax. A generative grammar of a language attempts to give a set of rules that will correctly predict which combinations of words will form… … Wikipedia
List of mathematics articles (S) — NOTOC S S duality S matrix S plane S transform S unit S.O.S. Mathematics SA subgroup Saccheri quadrilateral Sacks spiral Sacred geometry Saddle node bifurcation Saddle point Saddle surface Sadleirian Professor of Pure Mathematics Safe prime Safe… … Wikipedia
Algorithmic composition — is the technique of using algorithms to create music.Algorithms (or, at the very least, formal sets of rules) have been used to compose music for centuries; the procedures used to plot voice leading in Western counterpoint, for example, can often … Wikipedia
Syntax — Syntactic redirects here. For another meaning of the adjective, see Syntaxis. For other uses, see Syntax (disambiguation). Linguistics … Wikipedia
ImproVisor — Infobox Software name = Impro Visor developer = Bob Keller and students at Harvey Mudd College released = 2006 latest release version = 3.39 latest release date = June 2008 operating system = Microsoft Windows, Mac OS X, Linux license = GPL… … Wikipedia
Statistical parsing — is a group of parsing methods within natural language processing. The methods have in common that they associate grammar rules with a probability. Grammar rules are traditionally viewed in computational linguistics as defining the valid sentences … Wikipedia