Dot plot (bioinformatics)

﻿
Dot plot (bioinformatics)

A dot plot (a.k.a. contact plot or residue contact map) is a graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them. It is a kind of recurrence plot.

Introduction

One way to visualize the similarity between two protein or nucleic acid sequences is to use a similarity matrix, known as a dot plot. These were introduced by Philips[who?] in the 1970s[citation needed] and are two-dimensional matrices that have the sequences of the proteins being compared along the vertical and horizontal axes. For a simple visual representation of the similarity between two sequences, individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix.

Some idea of the similarity of the two sequences can be gleaned from the number and length of matching segments shown in the matrix. Identical proteins will obviously have a diagonal line in the center of the matrix. Insertions and deletions between sequences give rise to disruptions in this diagonal. Regions of local similarity or repetitive sequences give rise to further diagonal matches in addition to the central diagonal. Because of the limited protein alphabet, many matching sequence segments may simply have arisen by chance. One way of reducing this noise is to only shade runs or 'tuples' of residues, e.g. a tuple of 3 corresponds to three residues in a row. This is effective because the probability of matching three residues in a row by chance is much lower than single-residue matches. It can be seen from Figures 3.3h,c that the number of diagonal runs in the matrix has been considerably reduced by looking for 2-tuples or 3-tuples.

Dot plots are one of the oldest ways of comparing two sequences.[citation needed] They compare two sequences by organizing one sequence on the x-axis, and another on the y-axis, of a plot. When the residues of both sequences match at the same location on the plot, a dot is drawn at the corresponding position. Note, that the sequences can be written backwards or forwards, however the sequences on both axes must be written in the same direction. Also note, that the direction of the sequences on the axes will determine the direction of the line on the dot plot. Once the dots have been plotted, they will combine to form lines. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.

Example

A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence.

Example of a dot plot for comparing two simple protein sequences:

1. All cells associated with identical residue pairs between the sequences are shaded black;
2. Only those cells associated with identical tuples of two residues are shaded black; and,
3. Only cells associated with tuples of three are shaded and the optimal path through the matrix has been drawn.

This is constrained to be within the window given by the two black lines parallel to the central diagonal. An alternative high-scoring path is also shown.

References

Wikimedia Foundation. 2010.

Look at other dictionaries:

• Dot plot — may refer to: Dot plot (bioinformatics), for comparing two sequences Dot plot (statistics), data points on a simple scale This disambiguation page lists articles associated with the same title. If an internal link …   Wikipedia

• Plot (graphics) — Scatterplot of the eruption interval for Old Faithful (a geyser). A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a… …   Wikipedia

• Bioinformatics — For the journal, see Bioinformatics (journal). Map of the human X chromosome (from the NCBI website). Assembly of the human genome is one of the greatest achievements of bioinformatics. Bioinformatics …   Wikipedia

• Sequence (biology) — A sequence in biology is the one dimensional ordering of monomers, covalently linked within in a biopolymer; it is also referred to as the primary structure of the biological macromolecule.ee also* Protein sequence * DNA sequence * Self… …   Wikipedia

• Sequence alignment — In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1]… …   Wikipedia

• FASTA — is a DNA and Protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 in the article [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve db=pubmed dopt=Abstract list… …   Wikipedia

• Multiple sequence alignment — A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they… …   Wikipedia

• Monte Carlo method — Not to be confused with Monte Carlo algorithm. Computational physics …   Wikipedia

• Anthrax — For other uses, see Anthrax (disambiguation). Anthrax Classification and external resources Microphotograph of a Gram stain of the bacterium Bacillus anthracis, the cause of the anthrax disease …   Wikipedia

• Nussinov plots — File:Ruth Nussinov.jpg Ruth Nussinov RNA and tRNA generally has a complex two dimensional structure. Ruth Nussinov realized that there was a simple way to reveal the structure of RNA and tRNA. The methodology is to create a circle. Each base is… …   Wikipedia