Project 3

For this assignment, you will implement a file compression/decompression program.


Compressing a file

A file is a sequence of characters, where each character is represented using 8 bits (even though in Java, a char is a 16-bit value). However, some characters appear more often than others. If we could represent frequently occurring characters using fewer than 8 bits (and infrequently occuring characters with more than 8), then perhaps the file could be made smaller than it normally is.

We are going to use a strategy known as recency modeling to compress files. Here is the basic idea: As we read through a file, one character at a time, we keep track of how recently each character has appeared. We assume that recently observed characters are likely to appear again soon. If that assumption is correct, then for each compressed character, we can emit not the character itself, but its recency rank (e.g. 5 to represent that it is the 5th most recently observed character. Since small ranks will appear more often, we can represent them using fewer bits. Thus, the recency rank helps us determine which characters to represent with fewer bits.


Tracking recency rank

First, we need to be sure that the concept of recency rank is clear. As an example, consider reading a file one character at a time. Specifically, consider being in the middle of reading a file, where the contents observed so far are:

I am the very model of a modern major gene

The recency ranks of the characters are, at this point:

  1  2  3    4    5  6  7  8  9  10 11 12 13 14 15 16 17 18
  e  n  g [space] r  o  j  a  m  d  o  f  l  y  v  h  t  I

If we read the next character and it is an r, then the ranks must be updated:

  1  2  3  4    5    6  7  8  9  10 11 12 13 14 15 16 17 18
  r  e  n  g [space] o  j  a  m  d  o  f  l  y  v  h  t  I

Notice that to make r the most recently observed character, we moved it to the front of the ranking, demoting by one position all those characters that had previously been of higher rank.

Tracking the recency ranking can easily be performed using a linked list. Specifically, we want to keep a linked list that stores the characters in their recency rank order. When a particular character is observed (that is, read in from the file), then we perform the following steps:

  1. Search the list for the desired character. If we want to know the rank of the character, then we can count links as we traverse the list.
  2. If the character is found, then remove the character from that position in the list.
  3. Insert the character at the front of the list.

Your program, part A

You must write a program that can perform both compression and decompression. Specifically, a user should be able to start the program by issuing a command using the following form:

java CompressorA compress myfile.txt myfile.compressed
java CompressorA decompress myfile.compressed myfile.decompressed

Specifically, the user can request either compress or decompress as the operation to be performed, followed by the name of input file and then the output file.

During compression, as the program reads the input file (that is, the original, uncompressed file), it must perform two tasks for each character read:

  1. Emit to the output file (that is, the new, compressed file), either:
    1. The recency rank of the current input character if the character has been observed before (that is, if it exists in the recency rankings), or
    2. A zero (0) followed by the character itself if this is the first observation of this character.
  2. Update the recency ranking of the characters, thus making the current input character the most recently observed one.

Thus, if the input file contains...

abbcbbac

...then the output file should be:

	0
	a
	0
	b
	1
	0
	c
	2
	1
	3
	3

During decompression, the program must reverse the process. That is, it should read the compressed file, using the information found within to (a) emit the correct, uncompressed characters by (b) keeping the recency rankings up-to-date, just as the compression code did.

Warning: Be careful in how you handle the newline character ('\n'). Because each line of output in the above format ends with a newline character, the presence of this character as a legitimate part of the compressed data may require special code for it to be handled properly.


Variable-length bit representations

If you look carefully, you will notice that the output representation from part A can only produce a file that is larger than the original. That is, for each original character, we are emitting at least one and sometimes more "compressed" characters.

To create a truly compressed file -- one that is smaller than the original -- then we must represent some of the characters from the original file with less than 8 bits. How do we do that?

Consider that characters with low recency rank are likely to occur more often than ones with high recency rank. Thus, if we can represent low recency rank values using smaller numbers of bits, while representing high recency rank values using more bits, then we can reduce the total size of many files.

We will divide the representation of recency ranks into two categories:

  1. Ranks 1 to 16: We will represent these values using only 5 bits -- a 1 to indicate the use of a "short" value, and then a 4 bit sequence to represent one of the 16 possible short rank values.

  2. All other ranks: These values shall be represented using 9 bits -- a 0 to indicate the use of a "long" value, and then the standard 8 bit sequence used to represent that position number.

In other words, here are the values that should be used:

Original recency rank Bit representation
0 000000000
1 10000
2 10001
3 10010
4 10011
5 10100
6 10101
7 10110
8 10111
9 11000
10 11001
11 11010
12 11011
13 11100
14 11101
15 11110
16 11111
17 000010001
18 000010010
19 000010011
20 000010100
... ...

Your program, part B

Based on your CompressorA code, write CompressorB. The difference between the two versions will be the bit representations used for the values. Specifically, there will be two critical changes:

  1. The recency ranks will be represented using a variable number of bits, as described above.
  2. There will be no newline characters to separate the elements of the output.

From my directory, you must copy three Java files:

~sfkaplan/public/cs11/BitStream.java
~sfkaplan/public/cs11/BitReader.java
~sfkaplan/public/cs11/BitWriter.java

The BitStream class is used by the other two classes to provide the ability to read or write either 8-bit characters or individual 1-bit values. You should use BitWriter and BitReader objects much as you would use FileWriter and FileReader objects to write and read the necessary values.

BitWriter

BitReader


What you must submit

Use the cs11-submit command to turn both your CompressorA and CompressorB code. Submit this work as project-3. Note that I will test both programs by using them to compress and then decompress arbitrary files, and then comparing the decompressed result to the original file.

Note that if you complete only part A of the project, you can receive no better than an A- for this project. By doing part B, you can receive full credit. Note that if you do part B, you can lose no more than 2/3 of a letter grade for your work on it (thus ensuring that an attempt at part B will be positive towards your grade.)


The project is due Friday, 17-Dec-2004, at 11:59 pm

Scott F. Kaplan
Last modified: Sun Dec 12 13:21:36 EST 2004