proxy70

Data Compression
Ian Lance Taylor
AIRS
1/31/89

Many computer systems lose needed disk space to programs or data files which
are only used occasionally.  An example might be tax tables used by a
payroll program which is run only once a month.  It's too much trouble to
load the file from tape every month, but the result is permanently lost disk
space.    
        The best solution, of course, is to buy more or bigger disks.
However, there is a cheaper, if somewhat less convenient, alternative:
compressing the files.  They can be stored in a compressed format,
requiring less disk space, and uncompressed only while they are being used.
Since compressing and expanding the files takes time, it's only useful if
the files aren't used too often.  The disk space savings, however, can be
impressive. 
        Many data compression algorithms are known.  Some are simple but not
very effective, while others are difficult to implement but can compress
data enormously.  I will describe one of the simple data compression
methods here, and mention some others.  I have written, and AIRS has donated
to AMUS, a data compression which uses one of the more sophisticated known
algorithms. 

        Data compression generally involves taking a sequence of bytes and
translating them into a new sequence of bytes; ideally the output will be
shorter than the input.  Generally a compression program reads an input
file and produces an output file.  Compression programs are also used by
many newer modems to compress the data being sent over the phone line; of
course, this only works if the modem on the other end knows how to expand
the bytes it receives back to their original form.  I will discuss data
compression in terms of files, although similar considerations apply to
modems.   
        The simplest form of data compression is called "run-length
encoding."  A "run" is a sequence of identical bytes; for example,
"AAAABAAAAC" contains two runs of four A's.  All runs can be replaced by
one occurrence of the character and a byte indicating the number of time
the character occurs.  The above sequence would become "A4B1A4C1", which
saves 2 bytes.    
        Of course, if there aren't very many long runs in the file, the 1's
will waste too much space.  The compression program can't just leave them
out, however.  The file might actually contain a sequence like "B222", which
could be translated to "B123" (i.e. one B followed by three 2's), but could
not be translated to "B23" (does that mean two B's or three 2's?).
        Most run-length encoding programs use a special character to
indicate that a compressed run follows.  If this special character was X,
the first sequence above would become "XA4BXA4C".  To expand this, just copy
every byte unless it's an X; for an X read the next two bytes to see what
byte to repeat and how often to repeat it.  If there is an X in the input
file, it has to be handled carefully; it can be encoded as a run even if
only one or two appear (i.e. "X" would become "XX1"; note that this is not
compressed at all).   
        SuperVue 2.1 used a limited form of run-length encoding: it encoded
runs of spaces using characters which it knew could not appear in a
SuperVue file (presumably more recent versions of SuperVue do something of
this sort as well).  Many implementations of the Kermit file transfer
protocol also do run-length encoding. On some files, particularly files
containing bitmaps of graphics or files with a lot of zeroes or spaces,
run-length encoding can be very effective. Text or program (.lit) files
generally can not be compressed much, if at all, with this technique.      

        More sophisticated compression techniques typically use a different
number of bits to represent each byte in the original file.  Values which
appear more frequently in the input file are translated using fewer bits in
the output file.  In our earlier sample sequence, "AAAABAAAAC", A might be
translated as "0", B as "10", and C as "11".  The resulting bit sequence
would be "000010000011" (remember that there are eight bits in each byte, so
this actually only requires two bytes).  Expanding this sequence is easy:
reading it as a sequence of bits, every time a 0 is seen output an A; every
time a 1 is seen, read the next bit and, if it is 0 output a B, otherwise
output a C. 
        This example assumes there are only three possible characters.  To
see this, consider translating D to a bit sequence.  If the bit sequence
began with 0 it would be confused with A, if it began with 10 it would be
confused with B, and if it began with 11 it would be confused with C; but
there isn't anything else to begin a bit sequence with!  This illustrates
the "prefix property:" no character can be translated to a bit sequence
which is a prefix of the translation used for another character.   
        A "static" compression method will translate a particular character
as the same bit sequence throughout the entire file.  For example, G might
always become the bit sequence 1101.  The best compression possible using a
static method has been shown to be acheived by Huffman encoding, which was
invented in 1952.  The algorithm is too complex to describe here, but it is
nevertheless relatively easy to understand; see the references section below
to find more information. 
        There are some disadvantages to static compression methods. The
input file must be read twice, once to get a frequency count and once to
perform the actual encoding.  The frequency count is used to decide which
bytes get shorter translations. Also, the compression does not take
advantage of any "locality" which might be present in the file. It is
possible that a certain character appears many times at the beginning of the
file, and then only occasionally afterward.  It would often be better to
give such a character a short translation at the beginning and a longer one
later in the file (allowing a different character to use the shorter bit
sequence).  
        A compression method which uses different translations for the same
character is called "adaptive."  There are several adaptive compression
methods; it is not known which method acheives the greatest possible
compression, if any.  Most adaptive methods only read the input file once,
adjusting the output according to the frequency of bytes seen so far.  Some
adaptive compression methods are Lempel-Ziv encoding (used in the compress
program found on many UNIX (TM) systems), algorithm BSTW, and adaptive
Huffman encoding.    
        I have written, and AIRS has distributed to AMUS, a file compression
program based on adaptive Huffman encoding.  Adaptive Huffman encoding uses
a static Huffman code to translate each byte of the input file, but
rearranges the Huffman code according to the updated frequency count at each
byte.  This permits it to adapt to changing patterns in the file, and
doesn't require it to read the file twice.  The actual algorithm 
implemented was invented by Jeffrey Scott Vitter.
        The program is called CMPACT and is quite simple to use.  Typing 
"CMPACT filenm.ext" at the AMOS prompt will compress the file, creating
"filenm.CMP".  "CMPACT filenm.CMP", where the file was created by CMPACT, 
will recreate the original file.  The output file can be specified as a 
second argument on the command line.  Three switches can also be specified 
at the end of the command line: "/C" will force compression even if the 
file extension is ".CMP" (this can be used to show why compressing a file 
more than once does not make it smaller still; if it did, a file could be 
compressed again and again until it disappeared), "/E" will force expansion 
even if the file extension is not ".CMP", and "/D" will delete the output 
file if it exists already (CMPACT will not otherwise delete an existing 
file).

References

A good introduction to file compression, including the static Huffman code,
can be found in Robert Sedgewick's "Algorithms" (Addison-Wesley, 1984).  A
more complete discussion of compression in general is in the article "Data
Compression" by Debra A. Lelewer and Daniel S. Hirschberg (ACM Computing
Surveys, Vol. 19, No. 3, September 1987).  The static Huffman code was first
described by D. A. Huffman in "A method for the construction of
minimum-redundancy codes" (Proceedings of the IRE, No. 40, 1952).  The
adaptive Huffman algorithm used in the CMPACT program appears in "Design and
analysis of dynamic Huffman codes" by Jeffrey Scott Vitter (Journal of the
Association for Computing Machinery, Vol. 34, No. 4, October 1987).  
Finally, the run-length encoding used by the Kermit protocol is discussed in
"Kermit: A File Transfer Protocol" by Frank da Cruz (Digital Press, 1987).