LZ77 and LZ78

LZ77 and LZ78

LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively [http://www.patentstorm.us/patents/5532693-description.html] . These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and others.

They are both dictionary coders, unlike minimum redundancy coders. LZ77 is the "sliding window" compression algorithm, which was later shown to be equivalent to the explicit dictionary technique first given in LZ78.

LZ77

LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that have already passed through both encoder and decoder. A match is encoded by a pair of numbers called a "length-distance pair", which is equivalent to the statement "each of the next "length" characters is equal to the character exactly "distance" characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)

The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a "sliding window", which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.

Many documents which talk about LZ77 algorithms describe a length-distance pair as a command to "copy" data from the sliding window: "Go back "distance" characters in the buffer and copy "length" characters, starting from that point." While those used to imperative programming may find this model intuitive, it may also make it hard to understand a feature of LZ77 encoding: namely, that it is not only acceptable but frequently useful to have a length-distance pair where the length actually exceeds the distance. As a copy command, this is puzzling: "Go back "one" character in the buffer and copy "seven" characters, starting from that point." How can seven characters be copied from the buffer when only one of the specified characters is actually "in" the buffer? Looking at a length-distance pair as a statement of identity, however, clarifies the confusion: each of the next seven characters is identical to the character that comes one before it. This means that each character can be determined by looking back in the buffer – even if the character looked back to was not "in" the buffer when the decoding of the current pair began. Since by definition a pair like this will be repeating a sequence of "distance" characters multiple times, it means that LZ77 incorporates a flexible and easy form of run-length encoding.

Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they output their encoded data – what values are possible for lengths and distances, for example, and how length-distance pairs are distinguished from "literals" (single characters encoded as themselves, rather than as part of a length-distance pair.) A few examples:
* The algorithm illustrated in Lempel and Ziv's original 1977 paper outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal which followed that match. If two successive characters in the input stream could only be encoded as literals, the length would be 0.
* In the PalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
* As of 2008, the most popular LZ77 based compression method is called DEFLATE; it combines LZ77 with Huffman coding. Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; since a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice-versa.

LZ78

While the LZ77 algorithm works on past data, the LZ78 algorithm attempts to work on future data. It does this by forward scanning the input buffer and matching it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.

Though initially popular, the popularity of LZ78 later dampened, possibly because for the first few decades after it was introduced, parts of LZ78 were patent protected in the United States. The most popular form of LZ78 compression was the LZW algorithm, a modification of the LZ78 algorithm made by Terry Welch.

References

* Jacob Ziv and Abraham Lempel; [http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1977_universal_algorithm.pdf "A Universal Algorithm for Sequential Data Compression"] , IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Dictionary coder — A dictionary coder, also sometimes known as a substitution coder, is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure… …   Wikipedia

  • DEFLATE — redirects here. For other uses, see Deflation (disambiguation). Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP… …   Wikipedia

  • Lempel-Ziv-Welch — (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The… …   Wikipedia

  • Graphics Interchange Format — Infobox file format name = Graphics Interchange Format caption = A rotating globe in GIF format. The gradient blue areas of this image transition choppily, a common artifact produced when dithering is not employed. extension = .gif mime =… …   Wikipedia

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • Data deduplication — In computing, data deduplication is a specialized data compression technique for eliminating coarse grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the… …   Wikipedia

  • Universal coding — can refer to one of two concepts in data compression: * Universal code (data compression), a fixed prefix code that, for any probability mass function, has a data compression ratio within a constant of the optimal prefix code * Universal source… …   Wikipedia

  • LZC — Der LZW oder auch Lempel Ziv Welch Algorithmus ist ein häufig bei Grafikformaten zur Datenkompression, also zur Reduzierung der Datenmenge, eingesetzter Algorithmus. Ein Großteil der Funktionsweise dieses Algorithmus wurde 1978 von Abraham Lempel …   Deutsch Wikipedia

  • LZW — Der LZW oder auch Lempel Ziv Welch Algorithmus ist ein häufig bei Grafikformaten zur Datenkompression, also zur Reduzierung der Datenmenge, eingesetzter Algorithmus. Ein Großteil der Funktionsweise dieses Algorithmus wurde 1978 von Abraham Lempel …   Deutsch Wikipedia

  • LZW-Algorithmus — Der LZW oder auch Lempel Ziv Welch Algorithmus ist ein häufig bei Grafikformaten zur Datenkompression, also zur Reduzierung der Datenmenge, eingesetzter Algorithmus. Ein Großteil der Funktionsweise dieses Algorithmus wurde 1978 von Abraham Lempel …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”