Data Compression Ian Lance Taylor AIRS 1/31/89 Many computer systems lose needed disk space to programs or data files which are only used occasionally. An example might be tax tables used by a payroll program which is run only once a month. It's too much trouble to load the file from tape every month, but the result is permanently lost disk space. The best solution, of course, is to buy more or bigger disks. However, there is a cheaper, if somewhat less convenient, alternative: compressing the files. They can be stored in a compressed format, requiring less disk space, and uncompressed only while they are being used. Since compressing and expanding the files takes time, it's only useful if the files aren't used too often. The disk space savings, however, can be impressive. Many data compression algorithms are known. Some are simple but not very effective, while others are difficult to implement but can compress data enormously. I will describe one of the simple data compression methods here, and mention some others. I have written, and AIRS has donated to AMUS, a data compression which uses one of the more sophisticated known algorithms. Data compression generally involves taking a sequence of bytes and translating them into a new sequence of bytes; ideally the output will be shorter than the input. Generally a compression program reads an input file and produces an output file. Compression programs are also used by many newer modems to compress the data being sent over the phone line; of course, this only works if the modem on the other end knows how to expand the bytes it receives back to their original form. I will discuss data compression in terms of files, although similar considerations apply to modems. The simplest form of data compression is called "run-length encoding." A "run" is a sequence of identical bytes; for example, "AAAABAAAAC" contains two runs of four A's. All runs can be replaced by one occurrence of the character and a byte indicating the number of time the character occurs. The above sequence would become "A4B1A4C1", which saves 2 bytes. Of course, if there aren't very many long runs in the file, the 1's will waste too much space. The compression program can't just leave them out, however. The file might actually contain a sequence like "B222", which could be translated to "B123" (i.e. one B followed by three 2's), but could not be translated to "B23" (does that mean two B's or three 2's?). Most run-length encoding programs use a special character to indicate that a compressed run follows. If this special character was X, the first sequence above would become "XA4BXA4C". To expand this, just copy every byte unless it's an X; for an X read the next two bytes to see what byte to repeat and how often to repeat it. If there is an X in the input file, it has to be handled carefully; it can be encoded as a run even if only one or two appear (i.e. "X" would become "XX1"; note that this is not compressed at all). SuperVue 2.1 used a limited form of run-length encoding: it encoded runs of spaces using characters which it knew could not appear in a SuperVue file (presumably more recent versions of SuperVue do something of this sort as well). Many implementations of the Kermit file transfer protocol also do run-length encoding. On some files, particularly files containing bitmaps of graphics or files with a lot of zeroes or spaces, run-length encoding can be very effective. Text or program (.lit) files generally can not be compressed much, if at all, with this technique. More sophisticated compression techniques typically use a different number of bits to represent each byte in the original file. Values which appear more frequently in the input file are translated using fewer bits in the output file. In our earlier sample sequence, "AAAABAAAAC", A might be translated as "0", B as "10", and C as "11". The resulting bit sequence would be "000010000011" (remember that there are eight bits in each byte, so this actually only requires two bytes). Expanding this sequence is easy: reading it as a sequence of bits, every time a 0 is seen output an A; every time a 1 is seen, read the next bit and, if it is 0 output a B, otherwise output a C. This example assumes there are only three possible characters. To see this, consider translating D to a bit sequence. If the bit sequence began with 0 it would be confused with A, if it began with 10 it would be confused with B, and if it began with 11 it would be confused with C; but there isn't anything else to begin a bit sequence with! This illustrates the "prefix property:" no character can be translated to a bit sequence which is a prefix of the translation used for another character. A "static" compression method will translate a particular character as the same bit sequence throughout the entire file. For example, G might always become the bit sequence 1101. The best compression possible using a static method has been shown to be acheived by Huffman encoding, which was invented in 1952. The algorithm is too complex to describe here, but it is nevertheless relatively easy to understand; see the references section below to find more information. There are some disadvantages to static compression methods. The input file must be read twice, once to get a frequency count and once to perform the actual encoding. The frequency count is used to decide which bytes get shorter translations. Also, the compression does not take advantage of any "locality" which might be present in the file. It is possible that a certain character appears many times at the beginning of the file, and then only occasionally afterward. It would often be better to give such a character a short translation at the beginning and a longer one later in the file (allowing a different character to use the shorter bit sequence). A compression method which uses different translations for the same character is called "adaptive." There are several adaptive compression methods; it is not known which method acheives the greatest possible compression, if any. Most adaptive methods only read the input file once, adjusting the output according to the frequency of bytes seen so far. Some adaptive compression methods are Lempel-Ziv encoding (used in the compress program found on many UNIX (TM) systems), algorithm BSTW, and adaptive Huffman encoding. I have written, and AIRS has distributed to AMUS, a file compression program based on adaptive Huffman encoding. Adaptive Huffman encoding uses a static Huffman code to translate each byte of the input file, but rearranges the Huffman code according to the updated frequency count at each byte. This permits it to adapt to changing patterns in the file, and doesn't require it to read the file twice. The actual algorithm implemented was invented by Jeffrey Scott Vitter. The program is called CMPACT and is quite simple to use. Typing "CMPACT filenm.ext" at the AMOS prompt will compress the file, creating "filenm.CMP". "CMPACT filenm.CMP", where the file was created by CMPACT, will recreate the original file. The output file can be specified as a second argument on the command line. Three switches can also be specified at the end of the command line: "/C" will force compression even if the file extension is ".CMP" (this can be used to show why compressing a file more than once does not make it smaller still; if it did, a file could be compressed again and again until it disappeared), "/E" will force expansion even if the file extension is not ".CMP", and "/D" will delete the output file if it exists already (CMPACT will not otherwise delete an existing file). References A good introduction to file compression, including the static Huffman code, can be found in Robert Sedgewick's "Algorithms" (Addison-Wesley, 1984). A more complete discussion of compression in general is in the article "Data Compression" by Debra A. Lelewer and Daniel S. Hirschberg (ACM Computing Surveys, Vol. 19, No. 3, September 1987). The static Huffman code was first described by D. A. Huffman in "A method for the construction of minimum-redundancy codes" (Proceedings of the IRE, No. 40, 1952). The adaptive Huffman algorithm used in the CMPACT program appears in "Design and analysis of dynamic Huffman codes" by Jeffrey Scott Vitter (Journal of the Association for Computing Machinery, Vol. 34, No. 4, October 1987). Finally, the run-length encoding used by the Kermit protocol is discussed in "Kermit: A File Transfer Protocol" by Frank da Cruz (Digital Press, 1987).