CS 29 -- Final Exam Solutions

Before we get to the questions and answers, here are some general thoughts about the test: It was a difficult one. Nearly every question turned out to be challenging -- particularly the ones near the end. I was, by and large, impressed with the overall thinking that went into the answers, and with some of the good and creative answers to the difficult problems. Many people suffered because of sloppiness in their thinking and presentation, but it was a common ailment. Those that provided solutions or attempted solutions with less mess in what they provided got more credit. Overall, a difficult test that revealed some good thinking.

You will find that questions are in green, answers are in red, and discussion is in blue. Note that the answers given here are meant to be thorough in their presentation -- more thorough than I would necessarily expect from someone actually taking the exam.

Describe two different methods of multiplexing that can be used for sharing a single line. Why are these methods appropriate for use in the telephone system, but not appropriate for computing networks?

Both Frequency Division Multiplexing (FDM), which divides line capacity by allocating small portions of the available frequency range to a connection, and Time Division Multiplexing (TDM), which divides line capacity by allocation small portions of time to a connection, allow for the sharing of a single line for multiple connections.

These methods of sharing a high-capacity connection are static, is that the allocation to each connection being carried is fixed. For a telephone network, it is common for voice interchange, which requires a smaller amount of data transmission, to occur nearly continuously during the life of the connection, thus making the static allocation over the shared line yield high utilization. Computer networks, however, tend to send large amounts of data in infrequent bursts. Thus, a small, static allocation of the shared line for a connection would have low throughput during the moments of transmission, and low utilization during the moments of idleness.

Most people at least observed that FDM and TDM were two methods of sharing a single line, and provided descriptions of the two methods that were, at least, roughly correct. In describing the applicability to phone and computer networks, though, people commonly divided their arguments, providing some justifications for FDM, and different ones for TDM. It is important to recognize that these two methods have exactly the same consequences for utilization and bandwidth, and that arguments for one are applicable to the other.

It was also common for people to observe (correctly) that, if using FDM or TDM for a computing network, utilization would be low during those moments when most hosts were idle. Few people noticed, however, that during the bursts of transmission activity, bandwidth would be severely limited under FDM and TDM, thus providing another reason that these sharing mechanisms are poor ones for computing networks.
Consider the use of Hamming codes to send 11-bit messages. Specifically, consider the message 10011011011. First, calculate the Hamming codeword for this message. Second, invert one message bit in that Hamming codeword (to represent a 1-bit error) and show how the recipient can use the check bits to correct the inverted bit. (That is, perform the error correction.)
Place the message bits at the non-power-of-two positions in the final codeword, observing that we count positions starting at 1.
```
  Codeword |       1     0  0  1     1  0  1  1  0  1  1
  ---------|----------------------------------------------
  Position | 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
	    
```
To calculate the check bit at position 1, take the parity of those positions whose additive factors include 2⁰ = 1: Specifically, bits 3, 5, 7, 9, 11, 13, and 15. The bits at those positions sum to 1 + 0 + 1 + 1 + 1 + 0 + 1 = 5, and 5 mod 2 = 1, so the parity bit at position 1 is 1.

The other check bits are calculated in a similar fashion:
```
  Bit 2 = (Bit 3 + Bit 6 + Bit 7 + Bit 10 + Bit 11 + Bit 14 + Bit 15) mod 2
  Bit 4 = (Bit 5 + Bit 6 + But 7 + Bit 12 + Bit 13 + Bit 14 + Bit 15) mod 2
  Bit 8 = (Bit 9 + Bit 10 + Bit 11 + Bit 12 + Bit 13 + Bit 14 + Bit 15) mod 2
	    
```
Therefore, the final Hamming codeword is:
```
  Codeword | 1  1  1  0  0  0  1  1  1  0  1  1  0  1  1
  ---------|----------------------------------------------
  Position | 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
	    
```
Now consider an inversion of bit 13, representing a 1-bit transmission error. Begin with a counter c = 0, which will store the sum of the positions of check bits whose parity was incorrect. Test each check bit in turn to determine if its parity is correct. For example, for check bit 1, if the following is 0 then it is correct, and if it is 1 then it is incorrect:
```
  (Bit 1 + Bit 3 + Bit 5 + Bit 7 + Bit 9 + Bit 11 + Bit 13 + Bit 15) mod 2
	    
```
For each check bit whose parity is incorrect, add that check bit's index to c. So, if bit 13 has been inverted, check bits 1, 4, and 8 will be incorrect, and their position numbers will be added to c. Note that after all check bits have been evaluated, c = 13, indicating exactly the bit that has been inverted, and thus allowing it to be corrected.
The most common mistake here was to believe that the check bits could simply be appended, rather than placed at power-of-two positions. While this approach does work for 1-bit errors on message bits, it will fail if the 1-bit error occurs on one of the check bits. Another common error was to remember where the check bits went, but to forget how to compute their values.

A small number of people changed endianness from the example given above -- that is, they counted bits from right to left. This change is of no consequence to the correctness of the Hamming code; it merely implies that the answer in this form is different from the one shown above. Some people also used odd-parity rather than the even-parity used above. This change simply implies that check bit values are all inverted, and does not change the efficacy of the Hamming code.
Given the generator function G(x) = x⁴ + x + 1 and the message function M(x) = x⁷ + x⁶ + x⁴ + x² + x:

(a) Calculate the transmission function T(x).

(b) Consider that the transmission is damaged, such that the receiver receives R(x) = x¹¹ + x⁹ + x⁸ + x⁷ + x³ + x² + x + 1. Will this error be detected?
(a) G(x) corresponds to the bits 10011, and M(x) corresponds to the bits 11010110. To calculate T(x), we perform the following sequence of steps:
1. Increase the degree of every term in M(x) by r, the degree of G(x), to get x^rM(x). This step corresponds to appending r 0's to the message bits: 110101100000.
2. Obtain R = x^rM(x) mod G(x). To obtain this value with the corresponding bits, we perform long division for G(x) into x^rM(x) and take the remainder:
```
              11000010
        --------------
  10011 | 110101100000
         -10011
         ------
           10011
          -10011
          ------
            00001
           -00000
           ------
             00010
            -00000
            ------
              00100
             -00000
             ------
               01000
              -00000
              ------
                10000
               -10011
               ------
                 00110
                -00000
                ------
                 00110 = R
		
```
3. Form T(x) = x^rM(x) - R. By subtracting R, we guarantee that T(x) / G(x) leaves no remainder. The calculating with the corresponding bits is:
```
    110101100000
  -        00110
  --------------
    110101100110
		
```
  Therefore, T(x) = x¹¹ + x¹⁰ + x⁸ + x⁶ + x⁵ + x² + x.
(b) The bits corresponding to R(x) are 101110001111. If we take the XOR of the bits for T(x) and the bits for R(x), we can sum the number of 1's in the result to see how many bits differ:
```
      110101100110
  XOR 101110001111
  ----------------
      011011101001
	    
```
There are 7 bits that differ. CRC's always correctly detect errors formed by inverting an odd number of bits. Therefore, this error will be detected.
Most people remembered the basic form of this calculation. Many people made minor errors, or forgot steps. A few translated G(x) into bits incorrectly, or believed that the degree of G(x) was 5. Some divided incorrectly. Many forgot to multiply M(x) by x^r before dividing, and then simply appended an incorrect remainder.

People answered the second part of the question in two different ways. Some divided R(x) by G(x) to find that there was a remainder. Others simply observed that an odd number of bits were flipped during the transmission, and that CRC's always detect such errors. For those who incorrectly calculated T(x), I graded this part of the question as though it had been correct.
Consider a sender S and a receiver R that are connected on a network where the one-way latency is 250 ms (milliseconds), that the data rate is 32,000 b/s (bits per second), and that each frame is 1,000 b (bits). Assume that S and R are using sliding windows with selective repeat. How large must the window for S be in order to maximize utilization? Based on that window size, how many bits must be used in each frame as a frame identifier (that is, a frame number)?
A 250 ms one-way latency implies a 500 ms round-trip propagation delay. It is this delay that we wish to ``mask'' by using a window large enough to ensure that the sender is continually sending, rather than waiting for acknowledgements. So, we seek first to show how many frames a sender could send within the round-trip delay.
```
          32,000 bits       1 sec    32 bits
  Step A: ----------- x ---------- = -------
                 sec    1,000 msec      msec

          500 msec         32 bits   16,000 bits
  Step B: -------------- x ------- = -----------------
              round-trip      msec          round-trip

          16,000 bits/round-trip
  Step C: ---------------------- = 16 frames/round-trip
           1,000 bits/frame
	    
```
If a sender can transmit 16 frames during the minimum amount of time it would take for an acknowledgement to arrive, then the sender's window should be at least 16 frames. Because selective repeat requires that the range of frame numbers be at least twice as large as the window size, then there must be at least 32 frame identifiers. Therefore, at fewest, 5 bits of each frame must be a frame ID.
It is important to notice that the calculations above are conservatively simple for the given information. It does not account for the 1/32 sec required for R to receive the entire 1,000-bit frame, which must be part of the round-trip delay. It similarly assumes that the time required for either host to compute results is negligible, which is a reasonable assumption. Any answer that correctly computed some lower bound on the window size that masks the round-trip latency was given most or all of the credit.

The most common mistake was to forget that selective repeat requires that the frame number range be at least twice the window size. Another common error was to misunderstand the role of latency in this situation, sometimes multiplying the latency by the frame transmission time.
Consider the use of the distributed clock based on the happens before operator ->. Given two events a and b, is the following proposition true?

If C(a) < C(b) then a->b
No. Consider that events a and b occur in two different threads, T₁ and T₂, shown here:
```
    ^           ^
    |           |
  3 . y       b . 3
    |           |     ^
  2 . a       w . 2   |
    |           |    time
    |         z . 1
  1 . x         |
    |           |
   ---------------
    T1          T2
	    
```
The definition of -> allows for only three means by which to establish a happens-before relationship between events. We will consider each one and see that it does not apply here.
1. Both a and b are events in the same thread, and a preceeds b in the total ordering of events in that thread. Since a and b are not events in the same thread, this case does not apply.
2. Two events a and b occur in different threads, where a is the sending of a message, and b is the receipt of that message. Neither a nor b are communications events, so this case does not apply.
3. Consider three events a, c, and b. If a->c and c->b, then a->b. There is no third event c in the diagram above that would allow for this application of transitivity.
Therefore, in this diagram, we know that it is not true that a->b. However, it is possible that C(a) < C(b). Recall that each thread maintains a local clock C_i, and that the global clock is constructed such that any event e that occurs in thread j has global time C(e) = C_j(e). In the diagram above, each event is labeled with its local time. Therefore, C(a) = C₁(a) = 2, and C(b) = C₂(b) = 3. Therefore, it is true that C(a) < C(b).

Therefore, for this example, C(a) < C(b) is true, but a->b is false, thus proving that the proposed implication is false.
People often provided the correct answer, but followed it with exceedingly weak justifications that did not establish that C(a) < C(b) could hold while a->b did not. It was particularly common to show an example where a->b was false; however, this example was rarely followed with a demonstration of their clock times.

Deeper types of errors involved a misunderstanding of how the clock operated. Some believed that the clock was maintained by a specific thread. Others assumed (incorrectly) that a total ordering must have been imposed with the => operator, which was neither necessary nor supported by the question. The clock is distributed and based only on the partial ordering. It was sometimes clear, from people's answers, that the notion of a partial ordering itself was not well understood, as people claimed things such as a ``partial ordering within a thread.'' The events within a thread are, by definition, totally ordered. The partialness of the global ordering is due to the lack of precedence between concurrent operations in different threads.

An equally substantial and, in some ways, more general problem occurred with those who tried to support the proposition. In these cases, it was common to argue that a and b could either be in the same thread, or that a and b could be communications events in different threads, and under such circumstances the proposition would be true. It is important to see why this kind of argument is a kind of malformed proof: Contriving situations in which the proposition is true is not prove that the proposition is always true. It is important to try to disprove the proposition and show either that it can be disproved in at least one case, or that no possible set of circumstances for a and b violate the proposition, thus proving its truth. It is incorrect to state that the proposition is ``true and false, depending on the circumstances''. Either it is or it isn't. If it isn't in any one case, then the proposition as a whole is false.
For the three-phase parallel scanning algorithm, inputs were divided into n^1/2 chunks of n^1/2 elements each, thus requiring n^1/2 processors, one per chunk, in the local-work phases. Is it essential that the input be divided in exactly this manner to yield a work-efficient algorithm? Or can some other division of the input for local work also be work efficient?

There are other divisions that can be work efficient, but none with a depth smaller than O(n^1/2).

Firstly, we will show another division that is work efficient. Recall that a work efficient algorithm must perform work in O(n). The work for an input of size n is W(n) = p * D(n), where p is the number of processors used and D(n) is the depth of the parallel computation. Consider a division of the input into n^1/4 chunks of O(n^3/4) elements each. One processor is allocated per chunk, so p = n^1/4. The depth of the computation will be dominated by the first and third phases, which perform local work on the chunks, requiring depth in O(n^3/4). Therefore, for this division, W(n) = n^1/4 * O(n^3/4) = O(n), confirming that the division does yield work efficiency. Notice, however, that the depth is greater with this division, implying a longer runtime for the parallel computation.

Secondly, we note that is true that any division that uses more than n^1/2 chunks of fewer than n^1/2 elements per chunk will not be work efficient. If we consider dividing the input into n^3/4 chunks of n^1/4 elements each, we not only raise the number of processors p = n^3/4, but we also increase the runtime to be in O(n^3/4). The increase in depth occurs because, for divisions that require more than n^1/2 processors, the second phase of the algorithm dominates the depth, rendering the reduction in local work to be inconsequential.

Those who answered that different divisions cannot be work efficient received most of the credit if they demonstrate the second part of the above argument clearly. Those who observed the fully correct answer needed only to provide the first argument; the second was provided here because few people observed that ``over-application'' of the local work trick was still work efficient, albeit slower than necessary.

The answers to this question often established a solid grounding, only to fall apart within the last few steps (and sometimes right at the final, concluding statement). Some claimed that n^1/2 is uniquely special, but then showed analysis to the contrary, failing either to believe or acknowledge their own analysis in drawing their conclusion. Others claimed that n^1/2 is not unique, and performed analysis to show it, but then incorrectly claimed (in spite of correct analysis) that any division would yield work efficiency. Only a few correctly saw that larger chunks maintained work efficiency while increasing runtime (and tending towards serialization), and that smaller chunks violated work efficiency.
Recall that the distributed shared memory (DSM) mechanism that we outlined required that a thread, about to write to some virtual memory page x had to invalidate that page for all other threads.
1. How was this invalidation accomplished? What consistency model was implemented by this DSM?
2. What changes would need to be made to the DSM so that it implemented a weak consistency model where write/write conflicts were not prevented, and such concurrent writes were broken by a tiebreaker? (You may assume the use of thread ID as a possible tiebreaker.) Be sure to specify what steps the DSM must perform for read and write operations, including the communications that must occur between threads.
1. When a thread sought to write to a shared page, it would need exlcusive permission to write to that page (that is, a write lock.) Obtaining the lock required that the thread send a message to all other threads, requesting that each such thread remove read and write permission from any local copy of the given page that it might have. The other threads would have to send a message confirming that read/write permissions had been removed, thus allowing the writing thread to know that it is the only thread with a valid copy of the page. At that moment, the thread has obtained the write lock, and can apply read and write permissions to its copy of the page, thus allowing it to be modified. Any other thread that wishes to use the page will be prevented, by the lack of read/write permission, from using the page without first communicating with the writing thread to obtain a valid copy.
  
  Because invalidation and obtaining of the write lock is performed prior to the actual writing operation, and because only one thread may modify the page at a time, all read/write conflicts are resolved by imposing a total ordering. Read/read conflicts may still be allowed, and so this DSM implemented sequential consistency.
2. If write/write conflicts are allowed, then the DSM may grant write permission to any thread for any page without first communicating with any other thread. When a write operation occurs, the DSM needs only to broadcast the new value of the page, and the receiving threads immediately adopt that new copy of the page upon receipt.
  
  If a thread holds a copy of a page that was last modified by thread i, but then receives a copy of the same page that was modified concurrently by thread j, where j > i, then the thread should adopt the page received from j. Note that this notion of concurrent page modifications implies some event ordering similar to the one used for a global distributed clock. Such timestamps can be maintained by including local clock information with the broadcast of the page update.
I was not in search of anything deep for this question -- It was mostly one to determine whether or not you had really been paying attention to this topic. Any mention of the use of message passing, read/write permission bits, and their relationship to invalidation got much of the credit. Many people were confused in their answer to the second part, revealing that they weren't quite sure just what a write/write conflict was, and what it would mean to prevent (or not to prevent) it.
We want a solution to the problem of removing from a vector of integers all duplicate values. A sequential algorithm to solve this problem could perform a linear traversal the input vector, using a hash table to determine whether or not each value had been seen before, and discarding those that have.
1. Create a parallel algorithm for this problem based on the sequenial one (loosely) described above. Use pseudocode and comments to express your algorithm. Be sure the points at which parallel computation is invoked are clearly expressed and noted.
2. What are the work and depth of your parallel algorithm? Is its runtime faster than the sequential algorithm? If you reach a point in this analysis where you cannot reduce the expression further (such as a recurrence relation), you may stop there. In other words, I want to see that you know how each component to your algorithm contributes to the work and depth.
An algorithm to perform this task can be somewhat complex in its details. I will provide here a sketch of one approach that works, as well as a basic analysis of it. If you are interested in hearing about this solution in more detail, see me.

In a sequential algorithm for this problem, each element would be examined, and its value looked up in a hash table. If the hash table contained the value, then it would be skipped, and if the hash table did not contain that value, then the value would be added to the table and to a list of non-duplicates to be returned.

In a parallel version, we can use the hash table somewhat differently. In particular, we will consider a hash table with 1-element buckets and no collision resolution. A processors is assigned to each element in the input. Each processor attempts to write its value and the index of that value into the hash table. We need only to assume that when two processors try to write to the same location, one of them will succeed. Therefore, the write operations performed by some processors will not succeed, as some other processor will succeed in writing to the same location. Notice that two processors working with duplicate values will always write to the same location in the table.

To determine which processors wrote successfully, we have each processor read from the table entry to which it just wrote. If it finds that its value is in that entry, then either it or some other processor with the same value succeeded. By also examining the index stored in the entry, a processor can determine whether or not it is the one responsible for the successful write. It must then hold on to its value as being one that must be returned as a non-duplicate. Those processors that failed to write the very same value (that is, a duplicate) will not try to return their values in the result, thus eliminating the duplicates.

Those processors whose values were not in the hash table recursively begin the elimination procedure again. If the hashing function is a good one, then the number of processors that must recur will diminsh by a constant fraction with each recursion. Eventually, all non-duplicates will be identified and returned.

This particular approach requires O(n) processors, one per input element at the top level of the recursion. With each recursive call, there are a number of constant time operations. However, there is also a need to record those values that are identified as non-duplicates with each level of recursion. This step can be done using a plus-scan to perform array packing (elimination of non-null values) in O(lg n) time. The number of recursions should be O(log n) because of the reduction by constant factor in input size with each recursion. Therefore, the depth is in O(lg² n), and the work in O(n lg² n).

This problem was extremely difficult. I did not expect anyone to produce a complete answer. I wanted to give you an opportunity to try to create a substantial parallel algorithm for a non-trivial problem. Any steps taken in a reasonable direction to solve this problem were given credit.
We attempted to parallelize insertion sort, only to find that our parallel version was work inefficient. Specifically, recall that sequential insertion sort requires time in O(n²). Here is a copy of the parallel insertion sort that we outlined, written in NESL:
```
function insertionSort (v) =
  if (#v < 2) then
    v
  else
    let
      element = v[0];
      rest = insertionSort(drop(v, 1));
      index = count({x < element: x in rest})
    in
      take(rest, index) ++ [element] ++ drop(rest, index);
	    
```
We found that there would be n recursions (one per element), and that the limiting factor for each recursive call was the count() function, which had depth in O(lg n) to sum all of the true values produced by the parallel comparison between the element to be inserted and the sorted segment, rest. Therefore, D(n) = O(n lg n), and W(n) = n * D(n) = O(n² lg n), which is not in O(n²).

Change this parallel insertion sort algorithm to be work efficient. Describe (in prose or with pseudocode) the changes, and show the depth and work analysis to demonstrate that your new algorithm is work efficient.
We would like to reduce the work by a factor of lg n, thus leaving us two basic options: Either reduce the depth by this factor, or reduce the number of processors. We have been unsuccessful in attempting to reduce the depth of the count() function by lg n, but we had not attempted to reduce the number of processors. Therefore, we will apply the local work trick in standard fashion.

For each insertion of a new element, consider dividing the sorted segment into n/lg n chunks of lg n elements each. One processor is allocated per chunk, and so n/lg n processors can locally compare its lg n elements against the new one to be inserted. This operation has O(lg n) depth. Notice that this depth is greater than the O(1) depth achieved by using n processors in the example above.

We still must perform the count() operation on the results of the comparisons. However, we know how to sum values in O(lg n) depth using n/lg n processors. (Tree summation w/ local work trick again.)

So, each recursive step of this algorithm would require O(lg n) depth and n/lg n processors. There are n such recursive calls for an input of n elements to be sorted. Therefore, the total depth of the algorithm is D(n) = n * O(lg n) = O(n lg n). Finally, the work is W(n) = n/lg n * O(n lg n) = O(n²), which is work efficient.

People took two basic approaches to this problem: Some tried to reduce the computation performed at the count() operation to something that required O(1) time, while others tried to reduce the number of processors used. For the first approach, a number of people succeeded in describing something reasonable that would use n processors to compare one value with its neighbor, and thus determine (in parallel) which at which index the new element belonged in constant time. For this approach, however, I required that the answer be well specified, as some people attempted to describe something like this, and either dreamed up an approach that was much too complicated, or one that one incorrect.

The latter approach was taken in a number of different forms. Many started heading in the correct direction, only to misunderstand how the local-work trick is applied. Significantly, some people confused the level at which the local-work trick was applicable, trying to limit where the recursion bottoms out, as though insertion sort were a divide-and-conquer algorithm. If you made this mistake, be sure that you understand why such a line of argument won't work with this non-divide-and-conquer algorithm.

Scott F. Kaplan

Last modified: Wed Jan 9 16:30:13 EST 2002