Suffix tree

Suffix tree

In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

The suffix tree for a string S is a tree whose edges are labeled with strings, and such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree (more specifically, a Patricia trie) for the suffixes of S.

Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.

History

The concept was first introduced as a "position tree" by Weiner in 1973cite conference
author=P. Weiner
title=Linear pattern matching algorithm
booktitle=14th Annual IEEE Symposium on Switching and Automata Theory
year=1973
pages=1-11
] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976cite journal
author=Edward M. McCreight
title=A Space-Economical Suffix Tree Construction Algorithm
journal=Journal of the ACM
volume=23
issue=2
year=1976
pages=262--272
url=http://doi.acm.org/10.1145/321941.321946 | doi=10.1145/321941.321946
] , and also by Ukkonen in 1995cite journal
author=E. Ukkonen
title=On-line construction of suffix trees
journal=Algorithmica
volume=14
issue=3
year=1995
pages=249--260
url=http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf | doi=10.1007/BF01206331
] cite journal
author=R. Giegerich and S. Kurtz
title=From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction
journal=Algorithmica
volume=19
issue=3
year=1997
pages=331--353
url=http://www.zbh.uni-hamburg.de/staff/kurtz/papers/GieKur1997.pdf | doi=10.1007/PL00009177
] . Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm.

Definition

The suffix tree for the string S of length n is defined as a tree such that (cite book
last = Gusfield
first = Dan
origyear = 1997
year = 1999
title = Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
publisher = Cambridge University Press
location = USA
id = ISBN 0-521-58519-8
] page 90):
* the paths from the root to the leaves have a one-to-one relationship with the suffixes of S,
* edges spell non-empty strings,
* and all internal nodes (except perhaps the root) have at least two children.

Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted $). This ensures that no suffix is a prefix of another, and that there will be n leaf nodes, one for each of the n suffixes of S. Since all internal non-root nodes are branching, there can be at most n-1 such nodes, and n+(n-1)+1=2n nodes in total.

Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string chialpha, where chi is a single character and alpha is a string (possibly empty), it has a suffix link to the internal node representing alpha. See for example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in some algorithms running on the tree.

Functionality

A suffix tree for a string S of length n can be built in Theta(n) time, if the alphabet is constant or integer cite conference
author=Martin Farach
title=Optimal suffix tree construction with large alphabets
booktitle=Foundations of Computer Science, 38th Annual Symposium on
year=1997
pages=137--143
url=ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-48.ps.gz
] . Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).

Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D={S_1,S_2,...,S_K} of total length n=|n_1|+|n_2|+...+|n_K|.You can:

* Search for strings:
** Check if a string P of length m is a substring in O(m) time ( page 92).
** Find the first occurrence of the patterns P_1,...,P_q of total length m as substrings in O(m) time, when the suffix tree is built using Ukkonen's algorithm.
** Find all z occurrences of the patterns P_1,...,P_q of total length m as substrings in O(m + z) time ( page 123).
** Search for a regular expression "P" in time expected sublinear in n (cite journal
author=Ricardo A. Baeza-Yates and Gaston H. Gonnet
title=Fast text searching for regular expressions or automaton searching on tries
journal=Journal of the ACM
volume=43
number=6
year=1996
issn=0004-5411
pages=915--936
doi=10.1145/235809.235810
publisher=ACM Press
address=New York, NY, USA
] ).
** Find for each suffix of a pattern P, the length of the longest match between a prefix of P [i...m] and a substring in D in Theta(m) time ( page 132). This is termed the matching statistics for P.
* Find properties of the strings:
** Find the longest common substrings of the string S_i and S_j in Theta(n_i + n_j) time ( page 125).
** Find all maximal pairs, maximal repeats or supermaximal repeats in Theta(n + z) time ( page 144).
** Find the Lempel-Ziv decomposition in Theta(n) time ( page 166).
** Find the longest repeated substrings in Theta(n) time.
** Find the most frequently occurring substrings of a minimum length in Theta(n) time.
** Find the shortest strings from Sigma that do not occur in D, in O(n + z) time, if there are z such strings.
** Find the shortest substrings occurring only once in Theta(n) time.
** Find, for each i, the shortest substrings of S_i not occurring elsewhere in D in Theta(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Theta(n) time ( chapter 8). You can then also:
* Find the longest common prefix between the suffixes S_i [p..n_i] and S_j [q..n_j] in Theta(1) ( page 196).
* Search for a pattern P of length m with at most k mismatches in O(k n + z) time, where z is the number of hits ( page 200).
* Find all z maximal palindromes in Theta(n)( page 198), or Theta(g n) time if gaps of length g are allowed, or Theta(k n) if k mismatches are allowed ( page 201).
* Find all z tandem repeats in O(n log n + z), and k-mismatch tandem repeats in O(k n log (n/k) + z) ( page 204).
* Find the longest substrings common to at least k strings in D for k=2..K in Theta(n) time ( page 205).

Uses

Suffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in cite conference
author=Oren Zamir and Oren Etzioni
title=Web document clustering: a feasibility demonstration
booktitle=SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
year=1998
pages=46--54
publisher=ACM
address=New York, NY, USA
] ).

Implementation

If each node and edge can be represented in Theta(1) space, the entire tree can be represented in Theta(n) space. The total length of the edges in the tree is O(n^2), but each edge can be stored as the position and length of a substring of "S", giving a total space usage of Theta(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes.

An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:
* The cost of finding the child on a given character.
* The cost of inserting a child.
* The cost of enlisting all children of a node (divided by the number of children in the table below).

Let sigma be the size of the alphabet. Then you have the following costs:

Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures.

ee also

* Suffix array

References

External links

* [http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm Suffix Trees] by Dr. Sartaj Sahni (CISE Department Chair at University of Florida)
* [http://www.allisons.org/ll/AlgDS/Tree/Suffix/ Suffix Trees] by Lloyd Allison
* [http://datacompression.info/SuffixTrees.shtml Suffix Trees] links collection by Mark Nelson
* [http://www.nist.gov/dads/HTML/suffixtree.html NIST's Dictionary of Algorithms and Data Structures: Suffix Tree]
* [http://www.cl.cam.ac.uk/~cpk25/libstree/ libstree] , a generic suffix tree library written in C
* [http://search.cpan.org/dist/Tree-Suffix/ Tree::Suffix] , a Perl binding to libstree
* [http://www.cs.ucdavis.edu/~gusfield/strmat.html Strmat] a faster generic suffix tree library written in C (uses arrays instead of linked lists)
* [http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/ SuffixTree] a Python binding to Strmat
* [http://www.balkenhol.net/papers/t1043.pdf.gz Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice] , application of suffix trees in the BWT
* [http://www.cs.helsinki.fi/group/suds/ Theory and Practice of Succinct Data Structures] , C++ implementation of a compressed suffix tree]
* [http://video.google.com/videoplay?docid=3849716474680225865 Google video on constructing a suffix tree]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • suffix tree — noun (of a string) a radix tree containing all suffixes of the string …   Wiktionary

  • Generalised suffix tree — A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings D=S 1,S 2,dots,S d of total length n, it is a Patricia trie containing all n suffixes of the strings. It is mostly used in bioinformaticsref|BRCR.… …   Wikipedia

  • Prediction Suffix Tree — The concept of the Markov chain of order L, which we essentially owe to the Russianmathematician Andrej Andreevic Markov (1907), has two drawbacks. First, the number ofparameters of the model grows exponentially with the order L of the chain.… …   Wikipedia

  • Suffix array — In computer science, a suffix array is an array giving the suffixes of a string in lexicographical order.DetailsConsider the string abracadabra , of length 11. It has eleven suffixes: abracadabra , bracadabra , racadabra , and so on down to a .… …   Wikipedia

  • Tree (data structure) — A simple unordered tree; in this diagram, the node labeled 7 has two children, labeled 2 and 6, and one parent, labeled 2. The root node, at the top, has no parent. In computer science, a tree is a widely used data structure that emulates a… …   Wikipedia

  • tree — 1. noun /tɹiː,tʃɹiː/ a) A large plant, not exactly defined, but typically over four meters in height, a single trunk which grows in girth with age and branches (which also grow in circumference with age). is the tallest living tree in the world.… …   Wiktionary

  • suffix — 1. noun /ˈsʌfɪks,ˈsʌfɪks,səˈfɪks/ one or more letters or sounds added at the end of a word to modify the words meaning, such as able, which changes sing into singable, for example 2. verb /ˈsʌfɪks,ˈsʌfɪks,səˈfɪks/ to append (something) to the end …   Wiktionary

  • Compressed suffix array — The Compressed Suffix Array[1][2] is a compressed data structure for pattern matching. Given a text T of n characters from an alphabet Σ, the compressed suffix array support searching for arbitrary patterns in T. For an input pattern P of m… …   Wikipedia

  • Binary tree — Not to be confused with B tree. A simple binary tree of size 9 and height 3, with a root node whose value is 2. The above tree is unbalanced and not sorted. In computer science, a binary tree is a tree data structure in which each node has at… …   Wikipedia

  • Radix tree — In computer science, a radix tree (also patricia trie or radix trie) is a space optimized trie data structure where each node with only one child is merged with its child. The result is that every internal node has at least two children. Unlike… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”