For this assignment, you will implement a file compression/decompression program.
A file is a sequence of characters, where each character is represented using 8 bits (even though in Java, a char is a 16-bit value). However, some characters appear more often than others. If we could represent frequently occurring characters using fewer than 8 bits (and infrequently occuring characters with more than 8), then perhaps the file could be made smaller than it normally is.
We are going to use a strategy known as recency modeling to compress files. Here is the basic idea: As we read through a file, one character at a time, we keep track of how recently each character has appeared. We assume that recently observed characters are likely to appear again soon. If that assumption is correct, then for each compressed character, we can emit not the character itself, but its recency rank (e.g. 5 to represent that it is the 5th most recently observed character. Since small ranks will appear more often, we can represent them using fewer bits. Thus, the recency rank helps us determine which characters to represent with fewer bits.
First, we need to be sure that the concept of recency rank is clear. As an example, consider reading a file one character at a time. Specifically, consider being in the middle of reading a file, where the contents observed so far are:
I am the very model of a modern major gene
The recency ranks of the characters are, at this point:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 e n g [space] r o j a m d o f l y v h t I
If we read the next character and it is an r, then the ranks must be updated:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 r e n g [space] o j a m d o f l y v h t I
Notice that to make r the most recently observed character, we moved it to the front of the ranking, demoting by one position all those characters that had previously been of higher rank.
Tracking the recency ranking can easily be performed using a linked list. Specifically, we want to keep a linked list that stores the characters in their recency rank order. When a particular character is observed (that is, read in from the file), then we perform the following steps:
You must write a program that can perform both compression and decompression. Specifically, a user should be able to start the program by issuing a command using the following form:
java CompressorA compress myfile.txt myfile.compressed
java CompressorA decompress myfile.compressed myfile.decompressed
Specifically, the user can request either compress or decompress as the operation to be performed, followed by the name of input file and then the output file.
During compression, as the program reads the input file (that is, the original, uncompressed file), it must perform two tasks for each character read:
Thus, if the input file contains...
abbcbbac
...then the output file should be:
0 a 0 b 1 0 c 2 1 3 3
During decompression, the program must reverse the process. That is, it should read the compressed file, using the information found within to (a) emit the correct, uncompressed characters by (b) keeping the recency rankings up-to-date, just as the compression code did.
Warning: Be careful in how you handle the newline character ('\n'). Because each line of output in the above format ends with a newline character, the presence of this character as a legitimate part of the compressed data may require special code for it to be handled properly.
If you look carefully, you will notice that the output representation from part A can only produce a file that is larger than the original. That is, for each original character, we are emitting at least one and sometimes more "compressed" characters.
To create a truly compressed file -- one that is smaller than the original -- then we must represent some of the characters from the original file with less than 8 bits. How do we do that?
Consider that characters with low recency rank are likely to occur more often than ones with high recency rank. Thus, if we can represent low recency rank values using smaller numbers of bits, while representing high recency rank values using more bits, then we can reduce the total size of many files.
We will divide the representation of recency ranks into two categories:
Ranks 1 to 16: We will represent these values using only 5 bits -- a 1 to indicate the use of a "short" value, and then a 4 bit sequence to represent one of the 16 possible short rank values.
All other ranks: These values shall be represented using 9 bits -- a 0 to indicate the use of a "long" value, and then the standard 8 bit sequence used to represent that position number.
In other words, here are the values that should be used:
Original recency rank | Bit representation |
0 | 000000000 |
1 | 10000 |
2 | 10001 |
3 | 10010 |
4 | 10011 |
5 | 10100 |
6 | 10101 |
7 | 10110 |
8 | 10111 |
9 | 11000 |
10 | 11001 |
11 | 11010 |
12 | 11011 |
13 | 11100 |
14 | 11101 |
15 | 11110 |
16 | 11111 |
17 | 000010001 |
18 | 000010010 |
19 | 000010011 |
20 | 000010100 |
... | ... |
Based on your CompressorA code, write CompressorB. The difference between the two versions will be the bit representations used for the values. Specifically, there will be two critical changes:
From my directory, you must copy three Java files:
~sfkaplan/public/cs11/BitStream.java
~sfkaplan/public/cs11/BitReader.java
~sfkaplan/public/cs11/BitWriter.java
The BitStream class is used by the other two classes to provide the ability to read or write either 8-bit characters or individual 1-bit values. You should use BitWriter and BitReader objects much as you would use FileWriter and FileReader objects to write and read the necessary values.
public BitWriter (String filename) throws
IOException
Construct a BitWriter object that can write bits to
the file named by filename.
public void close () throws IOException
Close the output stream, flushing the bits to the file.
public void writeBit (boolean value)
Write 1 bit to the output file. If value is
true, then a 1 is written, and if
value is false, then a 0 is
written.
public void writeChar (char value)
Write 8 bits to the output file. Take the lower 8 bits of
value, which, for characters read from a file, will
cover all possible character values.
public BitReader (String filename) throws
IOException
Construct a BitReader object that can read bits
from the file named by filename.
public void close () throws IOException
Close the input stream.
public boolean atEOF ()
If the last bit of the input has been read, return
true; otherwise, return false.
public boolean readBit ()
Read 1 bit from the input file. If the bit is a 1
then true is returned; if the bit is a 0
then false is returned.
public char readChar ()
Read 8 bits from the input file. Return those 8 bits as a
character.
Use the cs11-submit command to turn both your CompressorA and CompressorB code. Submit this work as project-3. Note that I will test both programs by using them to compress and then decompress arbitrary files, and then comparing the decompressed result to the original file.
Note that if you complete only part A of the project, you can receive no better than an A- for this project. By doing part B, you can receive full credit. Note that if you do part B, you can lose no more than 2/3 of a letter grade for your work on it (thus ensuring that an attempt at part B will be positive towards your grade.)