Position-specific scoring matrix

Position-specific scoring matrix

A position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.cite journal |author=Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I |title=Identification of Transcription Factor Binding Sites with Variable-order Bayesian Networks |journal=Bioinformatics |volume=21 |issue=11 |date=2005 |pages=2657–2666 |url=http://bioinformatics.oxfordjournals.org/cgi/reprint/bti410?ijkey=KkxNhRdTSfvtvXY&keytype=ref |doi=10.1093/bioinformatics/bti410]

A PWM is a matrix of score values that gives a weighted match to any given substring of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the pattern. The score assigned by a PWM to a substring s=(s_j)_{j=1}^N is defined as extstyle sum_{j=1}^{N}{m_{s_j,j, where j represents position in the substring, s_j is the symbol at position j in the substring, and m_{alpha,j} is the score in row alpha, column j of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the substring.

Basic PWM with log-likelihoods

A PWM assumes independence between positions in the pattern, as it calculates scores at each position independently from the symbols at other positions.The score of a substring aligned with a PWM can be interpreted as the log-likelihood of the substring under a product multinomial distribution. Since each column defines log-likelihoods for each of the different symbols, where the sum of likelihoods in a column equals one, the PWM corresponds to a multinomial distribution. A PWM's score is the sum of log-likelihoods, which corresponds to the product of likelihoods, meaning that the score of a PWM is then a product-multinomial distribution. The PWM scores can also be interpreted in a physical framework as the sum of binding energies for all nucleotides (symbols of the substring) aligned with the PWM.

Incorporating background distribution

Instead of using log-likelihood values in the PWM, as described in the previous paragraph, several methods uses log-odds scores in the PWMs. An element in a PWM is then calculated as m_{i,j}=log(p_{i,j} / b_i), where p_{i,j} is the probability of observing symbol i at position j of the motif, and b_i is the probability of observing the symbol i in a background model. The PWM score then corresponds to the log-odds of the substring being generated by the motif versus being generated by the background, in a generative model of the sequence.

Information content of a PWM

The information content (IC) of a PWM is sometimes of interest, as it says something about how different a given PWM is from a uniform distribution.

The self-information of observing a particular symbol at a particular position of the motif is::-log(p_{i,j})

The expected (average) self-information of a particular element in the PWM is then::-p_{i,j} cdot log(p_{i,j})

Finally, the IC of the PWM is then the sum of the expected self-information of every element:: extstyle -sum_{i,j} p_{i,j}cdot log(p_{i,j})

Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g. the GC-content of DNA of thermophilic bacteria range from 65.3 to 70.8cite journal |author=Aleksandrushkina NI, Egorova LA |title=Nucleotide makeup of the DNA of thermophilic bacteria of the genus Thermus |journal=Mikrobiologiia |volume=47 |issue=2 |pages=250–2 |year=1978 |pmid=661633] , thus a motif of ATAT would contain much more information than a motif of CCGG). The equation for information content thus becomes: extstyle sum_{i,j} p_{i,j}cdot log(p_{i,j}/p_{b})where p_{b} is the background frequency for that letter.

References

External links

[http://jaspar.genereg.net/ JASPAR]


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Matrix — Contents 1 Science and mathematics 2 Technology 3 Arts and entertainment …   Wikipedia

  • Matrix (biology) — In biology, matrix (plural: matrices) is the material (or tissue) between animal or plant cells, in which more specialized structures are embedded, and a specific part of the mitochondrion that is the site of oxidation of organic molecules. The… …   Wikipedia

  • PWM — steht für: Flughafen Portland (Maine) in den USA als IATA Code Pulsweitenmodulation, eine Modulationsart, bei der eine technische Größe (z. B. elektrischer Strom) zwischen zwei Werten wechselt Private Wealth Management, eine Geschäftssparte der… …   Deutsch Wikipedia

  • Consensus sequence — In molecular biology and bioinformatics, consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. A consensus sequence is a way of representing the results of a multiple… …   Wikipedia

  • Homology modeling — Homology modeling, also known as comparative modeling of protein refers to constructing an atomic resolution model of the target protein from its amino acid sequence and an experimental three dimensional structure of a related homologous protein… …   Wikipedia

  • Michael Gribskov — is currently a professor of Biological Sciences and Computer Science at Purdue University. In 1979, Gribskov graduated from Oregon State University, with a Bachelor in Science Honors degree in Biochemistry and Biophysics. Later in 1985, he… …   Wikipedia

  • PSSM — can refer to: * Parallel Split Shadow Map * Position Specific Scoring Matrix * Principles and Standards for School Mathematics, a policy book on mathematics education * Polysaccharide storage myopathy, aka Equine polysaccharide storage myopathy… …   Wikipedia

  • TRANSFAC — (TRANScription FACtor database) ist eine manuell kuratierte Datenbank über eukaryote Transkriptionsfaktoren, deren genomische Bindungsstellen und DNA Bindungsprofile. Die Inhalte der Datenbank können mithilfe entsprechender Software zur… …   Deutsch Wikipedia

  • Sequence alignment — In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1]… …   Wikipedia

  • Glossary of contract bridge terms — These terms are used in Contract bridge[1][2] , or the earlier game Auction bridge, using duplicate or rubber scoring. Some of them are also used in Whist, Bid whist, and other trick taking games. This glossary supplements the Glossary of card… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”