Generalised suffix tree

Generalised suffix tree

A generalised suffix tree is a suffix tree for a set of strings. Given the set of strings D=S_1,S_2,dots,S_d of total length n, it is a Patricia trie containing all n suffixes of the strings. It is mostly used in bioinformaticsref|BRCR.

Functionality

It can be built in Theta(n) time and space, and you can use it to find all z occurrences of a string P of length m in O(m + z) time, which is asymptotically optimal (assuming the size of the alphabet is constant, see ref|Gus97 page 119).

When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node.

Algorithms for constructing a GST include Ukkonen's algorithm and McCreight's algorithm.

Example

A suffix tree for the strings ABAB and BABA is shown in a figure above. They are padded with the unique terminator strings $0 and $1. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on $ from the root are left out in this example.

Alternatives

An alternative to building a generalised suffix tree is to concatenate the strings, and build a regular suffix tree or suffix array for the resulting string. When you evaluate the hits after a search, you map the global positions into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.

References

* cite conference
author=Paul Bieganski, John Riedl, John Carlis, and Ernest F. Retzel
title=Generalized Suffix Trees for Biological Sequence Data
booktitle=Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference on.
year=1994
pages=35-44
url=http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=323593

* cite book
last = Gusfield
first = Dan
origyear = 1997
year = 1999
title = Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
publisher = Cambridge University Press
location = USA
id = ISBN 0-521-58519-8

External links

* [http://www.cosmion.net/jeroen/software/gst/ Online GST demo] : a web demo for generating a generalised suffix tree.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Suffix tree — In computer science, a suffix tree (also called suffix trie, PAT tree or, in an earlier form, position tree) is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many… …   Wikipedia

  • Longest common substring problem — The longest common substring problem is to find the longest string (or strings) that is a substring (or are substrings) of two or more strings. It should not be confused with the longest common subsequence problem. (For an explanation of the… …   Wikipedia

  • GST — may stand for:* Goods and Services Tax, a type of value added tax * Generation skipping transfer tax, imposed by the United States on certain transfers by gift, inheritance, or bequest * General strain theory, states that social structures within …   Wikipedia

  • Wagiman language — language name=Wagiman familycolor=Australian region=Pine Creek, Northern Territory, Australia speakers=10 (2000) fam1=Gunwinyguan iso2=aus iso3=waqWagiman (also spelled Wageman , Wakiman , Wogeman ) is a near extinct indigenous Australian… …   Wikipedia

  • German language — German Deutsch Pronunciation [ˈdɔʏtʃ] Spoken in Primarily in German speaking Europe, as a minority language and amongst the German diaspora worldwide …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”