CS 12 -- Final Lab

Compressing English text

This project is easily described: Write two Java programs---one that compresses English text, and another that decompresses it. More specifically, though:

The compression program should take two command-line arguments. The first is the name of the source (uncompressed) file, and the second is the name of the destination (compressed) file. Thus, it should be invoked like so:

java Compressor foo.txt foo.compressed

The contents of the original file (foo.txt) should be compressed and written to a new file (foo.compressed). While you may write as many classes to perform this task as you see fit, the main() method for this program should be a member of the Compressor class.
The decompression program should also take two command-line arguments. The first is the filename of the compressed, source file; the second is the filename of the uncompressed, destination file. It should be invoked like this:

java Decompressor foo.compressed foo-copy.txt

The compressed data (in the file foo.compressed) should be decompressed and written to a new file (foo-copy.txt). The destination file should have the identical contents to the original file used to create the compressed representation; that is, the compression and decompression must be lossless.

Again, you may write as many classes as needed, but the main() method of the decompression program must be named Decompressor.
You must write a file, named README, that describes your compression algorithm at an abstract level. You must make it clear not only that your compressor works, but that you understand how it works. You will get little credit for a compression strategy that you cannot explain.

A good default is to use a static, level-0 Markov model for the file being compressed, and to then encode your data using Huffman codes. You can get as much as 95 points for a very well written implementation of this strategy. If you want to do better than that, you must employ some additional or other strategies, such as dynamic modeling, arithmetic coding, dictionary or higher-order modeling, etc.
All code must be carefully formatted and well commented. It should be clear how the compressor and decompressor work by reading the code and its comments.
Both programs must run correctly! A program that crashes or mangles the data it compresses and decompresses will get little credit.
Both programs must run in a reasonable amount of time. The files that we will use for testing (below) are not large, and should take no more than a few minutes (that is, not more than 10 minutes) of CPU time on romulus both to compress and to decompress and one of those files.

To test your work, I recommend creating a few of your own, short files. Once your code is working, you can copy the following two files from my public directory to test the efficacy of your compression techniques. A level-0 Markov modeling w/ Huffman encoding should compress these files to roughly 50% of their original size:

~sfkaplan/public/cs12/alice-in-wonderland.txt ~sfkaplan/public/cs12/as-you-like-it.txt

Submitting your work

Use the cs12-submit program to submit only your source code, which should have been written as a number of .java files (one per class). Submit your work as lab-final.

This assigment is due on Friday, May 14^th at 5:00 pm! (NOTE THE TIME OF DAY)

Scott F. Kaplan

Last modified: Mon Apr 26 15:37:18 EDT 2004