Burrows-Wheeler transform

Burrows-Wheeler transform

The Burrows-Wheeler transform (BWT, also called block-sorting compression), is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California.cite|author=Burrows M and Wheeler D |title= [http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html A block sorting lossless data compression algorithm] |publisher=Technical Report 124, Digital Equipment Corporation |date=1994] It is based on a previously unpublished transformation discovered by Wheeler in 1983.

When a character string is transformed by the BWT, none of its characters change value. The transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding.

For example:The output is easier to compress because it has many repeated characters.

Example

The transform is done by sorting all rotations of the text, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF' pointer):

Transformation
InputAll
Rotations
Sort the
Rows
Output
^BANANA@ ^BANANA@ @^BANANA A@^BANAN NA@^BANA ANA@^BAN NANA@^BA ANANA@^B BANANA@^ ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA BNN^AA@A

The following pseudocode gives a simple, but inefficient, way to calculate the BWT and its inverse. It assumes that the input string s contains a special character 'EOF' which is the last character, occurs nowhere else in the text, and is ignored during sorting.

function BWT ("string" s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) function inverseBWT ("string" s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character)

To understand why this creates more-easily-compressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is "reversible", allowing the original document to be re-generated from the last column data.

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all "pairs" of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first "and second" columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:

Inverse Transformation
Input
BNN^AA@A
Add 1Sort 1Add 2Sort 2
B N N ^ A A @ A A A A B N N ^ @ BA NA NA ^B AN AN @^ A@ AN AN A@ BA NA NA ^B @^
Add 3Sort 3Add 4Sort 4
BAN NAN NA@ ^BA ANA ANA @^B A@^ ANA ANA A@^ BAN NAN NA@ ^BA @^B BANA NANA NA@^ ^BAN ANAN ANA@ @^BA A@^B ANAN ANA@ A@^B BANA NANA NA@^ ^BAN @^BA
Add 5Sort 5Add 6Sort 6
BANAN NANA@ NA@^B ^BANA ANANA ANA@^ @^BAN A@^BA ANANA ANA@^ A@^BA BANAN NANA@ NA@^B ^BANA @^BAN BANANA NANA@^ NA@^BA ^BANAN ANANA@ ANA@^B @^BANA A@^BAN ANANA@ ANA@^B A@^BAN BANANA NANA@^ NA@^BA ^BANAN @^BANA
Add 7Sort 7Add 8Sort 8
BANANA@ NANA@^B NA@^BAN ^BANANA ANANA@^ ANA@^BA @^BANAN A@^BANA ANANA@^ ANA@^BA A@^BANA BANANA@ NANA@^B NA@^BAN ^BANANA @^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ ANANA@^B ANA@^BAN @^BANANA A@^BANAN ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA
Output
^BANANA@

A number of optimizations can make these algorithms run more efficiently without changing the output. In BWT, there is no need to represent the tablein either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibitbad worst-case behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generatedone character at a time from right to left. The example code below demonstrates efficient decoding.A "character" in the algorithm can be a byte, or a bit, or any other convenient size.

There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.

There is a bijective version that causes no expansion and no index so that the length is preserved. The bijective version allows for any file to have an inverse unlike the standard versions. For more information plus C code see the External Links section.

A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.

Sample implementations

Python language

Using the null character as the end of file sigil, and using s [i:] + s [:i] to construct the ith rotation of s, the forward transform takes the last character of each of the sorted rows:

def bwt(s): s = s + '


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Compresión de Burrows-Wheeler — La Transformación de Burrows–Wheeler (BWT del inglés Burrows–Wheeler transform, también conocida como compresión por ordenación de bloques), es un algoritmo usado en técnicas de compresión de datos como en bzip2. Fue inventado por Michael Burrows …   Wikipedia Español

  • Transformation de Burrows-Wheeler — Transformée de Burrows Wheeler Pour les articles homonymes, voir BWT. La transformée de Burrows Wheeler, couramment appelée BWT (pour Burrows Wheeler Transform) est une technique utilisée en compression de données. Elle fut inventée par Michael… …   Wikipédia en Français

  • Transformee de Burrows-Wheeler — Transformée de Burrows Wheeler Pour les articles homonymes, voir BWT. La transformée de Burrows Wheeler, couramment appelée BWT (pour Burrows Wheeler Transform) est une technique utilisée en compression de données. Elle fut inventée par Michael… …   Wikipédia en Français

  • Transformée de burrows-wheeler — Pour les articles homonymes, voir BWT. La transformée de Burrows Wheeler, couramment appelée BWT (pour Burrows Wheeler Transform) est une technique utilisée en compression de données. Elle fut inventée par Michael Burrows et David Wheeler. Cette… …   Wikipédia en Français

  • Transformée de Burrows-Wheeler — Pour les articles homonymes, voir BWT. La transformée de Burrows Wheeler, couramment appelée BWT (pour anglais : Burrows Wheeler Transform) est une technique utilisée en compression de données. Elle fut inventée par Michael Burrows et David… …   Wikipédia en Français

  • Move-to-front transform — The move to front transform (or MTF) is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually …   Wikipedia

  • Transformation de Burrow-Wheeler — Transformée de Burrows Wheeler Pour les articles homonymes, voir BWT. La transformée de Burrows Wheeler, couramment appelée BWT (pour Burrows Wheeler Transform) est une technique utilisée en compression de données. Elle fut inventée par Michael… …   Wikipédia en Français

  • David Wheeler (computer scientist) — Not to be confused with David A. Wheeler, also a computer scientist. David John Wheeler Born 9 February 1927(1927 02 09) Birmingham Died 13 Decemb …   Wikipedia

  • Michael Burrows — This article is about the computer scientist. For the Church of Ireland (Anglican) bishop, see Michael Burrows (bishop). Michael Burrows (born circa 1963) is widely known as the creator of the Burrows–Wheeler transform. He also was, with Louis… …   Wikipedia

  • Data compression — Source coding redirects here. For the term in computer programming, see Source code. In computer science and information theory, data compression, source coding or bit rate reduction is the process of encoding information using fewer bits than… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”