Introduction to Computer Science II

Final Project for 2007-Dec-21

Data compression


This description has been critically changed on 2007-Dec-13, at about 11:20 pm. Here is a list of the changes:

  1. Instead of the previous description of the standard BitSet class, there is now a description of my custom-made subclass of BitSet, known as BitSequence.

  2. There are two more sample test files listed -- they are very small, and thus good for debugging.

  3. There is a description of my pre-made HuffmanTree and HuffmanNode classes, including instructions on how to obtain them. You should use these to write and debug your compressor and decompressor. Having done that, you should remove my Huffman code and write your own in its place. Thus, it is just a building block to encourage incremental development.


Compressing English text

This project is easily described: Write two Java programs---one that compresses English text, and another that decompresses it. There are some details about this project, however, that are provided here:

  1. The compression program should take two command-line arguments. The first is the name of the source (uncompressed) file, and the second is the name of the destination (compressed) file. Thus, it should be invoked like so:

    java Compressor some-file.txt some-file.txt.cmp

    The contents of some-file.txt should be compressed and written to the file some-file.txt.cmp. While you may write as many additional, supporting classes as you see fit, the main() method for this program should be a member of the Compressor class.

  2. The decompression program should also take two command-line arguments. The first is the filename of the compressed source, the second is the filename of the uncompressed destination. It should be invoked like this:

    java Decompressor some-file.txt.cmp uncompressed-copy-of-some-file.txt

    The compressed data in some-file.txt.cmp should be decompressed and written to the file uncompressed-copy-of-some-file.txt. The destination file should have the identical contents to the original file used to create the compressed representation; that is, the compression and decompression must be lossless.

    Again, you may write as many classes as needed, but the main() method of the decompression program must be named Decompressor.

  3. All code must be carefully formatted and well commented. It should be clear how the compressor and decompressor work by reading the code and its comments.

  4. Both programs must run correctly! A program that crashes or mangles the data it compresses and decompresses will get little credit.

To test your work, I recommend creating a few of your own, short files. Once your code is working, you can copy the following four files from my public directory to test the efficacy of your compression techniques; you should also copy the BitSequence class, which will be described later, and will be mightily handy:

~sfkaplan/public/cs12/project-final/tiny.txt
~sfkaplan/public/cs12/project-final/small.txt
~sfkaplan/public/cs12/project-final/alice-in-wonderland.txt
~sfkaplan/public/cs12/project-final/as-you-like-it.txt
~sfkaplan/public/cs12/project-final/BitSequence.java

A basic compressor

You should begin by writing a basic version of the compressor/decompressor pair. This version should use a static 0th-order Markov model of the data to be compressed. That model should be used to drive a Huffman coding of the input. Such a basic compressor can, at best, earn an A-.

Communicating the model

After the compressor samples the input to build its model, that model must be recorded in the beginning of output file. To do that, we shall use object serialization. Specifically, we will take advantage of built-in Java object types that are capable of recording themselves into, and restoring themselves from, a file.

First, you must store the model within an object that can be serialized. For this purpose, the Vector class is well suited for storing a 0th-order Markov model. A Vector behaves quite a bit like an array, but the syntax of how you would use it is different. For example, to make a 10-element Vector of integers and initialize the entries as having values twice that of their indices, and then print its contents:

  Vector v = new Vector();
  v.setSize(10);
  for (int i = 0; i < v.size(); i++) {
    v.set(i, 2*i);
  }
  for (int i = 0; i < v.size(); i++) {
    System.out.println("v[" + i + "] = " + v.get(i));
  }
    

This, being able to create, assign values into, and obtain values from a Vector, you can make one contain the complete table of frequencies that represent your model. Then, to record such an object's contents into a file, you can use its capability for serialization by following this example, where v is a pointer that can point to a Vector, and where filename is a String:

  ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(filename));
  oos.writeObject(v);
  oos.close();
    

Similarly, reconstructing the object based on the file's contents is done using the following steps.

  ObjectInputStream ois = new ObjectInputStream(new FileInputStream(filename));
  v = (Vector)ois.readObject();
  ois.close()
    

Communicating the compressed message

Once your Huffman tree is built, based on your Markov model, then you must use it to encode each symbol. Thus, for each symbol, you must move down the tree, starting at the root, adding a 0 each time you follow a left child, and a 1 each time you follow a right child. How do you store these bits? And then how do you record them in the file?

A BitSequence object is perfectly suited to this task. A BitSequence has arbitrary length and automatically resizes itself. Thus, you can set (assign the binary value 1) or clear (assign the binary value 0) to one bit after another within a BitSequence. The object remembers where it is in the sequence (much like a file does), and it also remembers how far into the sequence you have assigned bit values. Here's an example of using a BitSequence:

  BitSequence bs = new BitSeqence();
  bs.setNext();
  bs.clearNext();
  bs.clearNext();
  bs.setNext();
  bs.setIndex(0);
  int i = 0;
  while (bs.hasNext()) {
    System.out.println("bs[" + i + "] = " + bs.getNext());
    i++;
  }
    

Best of all, once you have assigned the complete sequence of bits to represent the file being compressed, you can record the BitSequence object's contents using serialization, just as you did with a Vector. That is, you need only call writeObject() on an ObjectOutputStream to write out the object, and then later call readObject() on an ObjectInputStream.

Critically, you may write (or read) multiple objects, in sequence, to a file using serialization. For example, if v points to a Vector and bs points to a BitSequence, then the following will serialize both to a file:

  ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(filename));
  oos.writeObject(v);
  oos.writeObject(bs);
  oos.close();
    

Using some pre-made Huffman code

First, obtain a copy of my pre-compiled class files for HuffmanTree and HuffmanNode, where are here:

~sfkaplan/public/cs12/project-final/HuffmanTree.class
~sfkaplan/public/cs12/project-final/HuffmanNode.class

Note that HuffmanNodes are used by HuffmanTrees -- you do not need to use HuffmanNodes directly in your code. Instead, your code should create and call on HuffmanTree objects. Here is a listing of the public methods of a HuffmanTree:

Recall that the purpose of this code is for you to use it as a stepping stone. Write a complete compressor and a complete decompressor using it. Debug that code, making sure that everything you have written works properly. Once you have done so, you can discard my Huffman classes and write your own. You are welcome to use the BitSequence class -- you need not replace that one with your own code.


A more advanced compressor

If you care to remove any limits to your grade you must improve upon the basic compressor. That is, you must do one of two things:

  1. Improve the modeling: Instead of using a static 0th-order Markov model, use something else that is more effective. The model could be dynamic, could be of higher (or mixed) order, could use a dictionary model, or employ any other method of capturing the statistical regularities of the input.

  2. Improve the encoding: Remove the limitation of Huffman coding by employing arithmetic coding. Note that to do so, you almost certainly will need to use BigDecimal objects to represent, calculate, and store (through serialization) the values involved.


Submitting your work

Use the cs12-submit program to submit only your source code, which should have been written as a number of .java files (one per class). Submit your work as project-final, something like this:

cs12-submit project-final *.java

Note that, this assignment being a final project, I expect beautiful code: well commented, carefully and consistently structured, employing clear logic and design.


This assigment is due on Friday, December 21st at 5:00 pm!

Scott F. H. Kaplan
Last modified: Mon Nov 3 14:12:24 EST 2003