- LZ77 and LZ78
LZ77 and LZ78 are the names for the two
lossless data compression algorithm s published in papers byAbraham Lempel andJacob Ziv in1977 and1978 . They are also known as LZ1 and LZ2 respectively [http://www.patentstorm.us/patents/5532693-description.html] . These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and others.They are both
dictionary coder s, unlike minimum redundancy coders. LZ77 is the "sliding window" compression algorithm, which was later shown to be equivalent to the explicit dictionary technique first given in LZ78.LZ77
LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that have already passed through both encoder and decoder. A match is encoded by a pair of numbers called a "length-distance pair", which is equivalent to the statement "each of the next "length" characters is equal to the character exactly "distance" characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)
The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a "sliding window", which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.
Many documents which talk about LZ77 algorithms describe a length-distance pair as a command to "copy" data from the sliding window: "Go back "distance" characters in the buffer and copy "length" characters, starting from that point." While those used to
imperative programming may find this model intuitive, it may also make it hard to understand a feature of LZ77 encoding: namely, that it is not only acceptable but frequently useful to have a length-distance pair where the length actually exceeds the distance. As a copy command, this is puzzling: "Go back "one" character in the buffer and copy "seven" characters, starting from that point." How can seven characters be copied from the buffer when only one of the specified characters is actually "in" the buffer? Looking at a length-distance pair as a statement of identity, however, clarifies the confusion: each of the next seven characters is identical to the character that comes one before it. This means that each character can be determined by looking back in the buffer – even if the character looked back to was not "in" the buffer when the decoding of the current pair began. Since by definition a pair like this will be repeating a sequence of "distance" characters multiple times, it means that LZ77 incorporates a flexible and easy form ofrun-length encoding .Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they output their encoded data – what values are possible for lengths and distances, for example, and how length-distance pairs are distinguished from "literals" (single characters encoded as themselves, rather than as part of a length-distance pair.) A few examples:
* The algorithm illustrated in Lempel and Ziv's original 1977 paper outputs all its data three values at a time: the length and distance of the longest match found in the buffer, and the literal which followed that match. If two successive characters in the input stream could only be encoded as literals, the length would be 0.
* In thePalmDoc format, a length-distance pair is always encoded by a two-byte sequence. Of the 16 bits that make up these two bytes, 11 bits go to encoding the distance, 3 go to encoding the length, and the remaining two are used to make sure the decoder can identify the first byte as the beginning of such a two-byte sequence.
*As of 2008 , the most popular LZ77 based compression method is calledDEFLATE ; it combines LZ77 withHuffman coding . Literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances can be safely placed into a separate alphabet; since a distance only occurs just after a length, it cannot be mistaken for another kind of symbol or vice-versa.LZ78
While the LZ77 algorithm works on past data, the LZ78 algorithm attempts to work on future data. It does this by forward scanning the input buffer and matching it against a dictionary it maintains. It will scan into the buffer until it cannot find a match in the dictionary. At this point it will output the location of the word in the dictionary, if one is available, the match length and the character that caused a match failure. The resulting word is then added to the dictionary.
Though initially popular, the popularity of LZ78 later dampened, possibly because for the first few decades after it was introduced, parts of LZ78 were
patent protected in theUnited States . The most popular form of LZ78 compression was the LZW algorithm, a modification of the LZ78 algorithm made byTerry Welch .References
* Jacob Ziv and Abraham Lempel; [http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1977_universal_algorithm.pdf "A Universal Algorithm for Sequential Data Compression"] , IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977.
Wikimedia Foundation. 2010.