This page contains questions that people have asked via email in studying for the exam, and the answers that I provided. With a little luck, it would provide a useful study aid for everyone. Names have been removed to protect the guilty.
Q: Suppose we're using hubs and switches in our "more modern Ethernet" which creates the illusion of a shared medium while keeping traffic as local as possible. Can we still tap in to the "shared medium" by branching off a live Ethernet cable? How would the hub handle the traffic? Or does the revised protocol specifically prohibit this sort of thing?
A: Hubs and switches presuppose that the cables connecting individual hosts to networks are intended to be just that: individual. Even under the older Ethernet standards for twisted pair (that is, phone-wires-on-steriods), the wires themselves are not ones into which you can tap -- they're not actually shared. The illusion of a shared medium is created by the hub or (more selectively) by the switch. If you tried to connect two hosts to a single twisted-pair wire and plug it into a hub, it might work, if you can get the electrical properties of the line to behave, because hubs repeat all traffic. If you do such a thing on a switch, it isn't likely to work at all, because the switch has to assume that it knows the Ethernet address of the host on the other end of each connection.
Q: I read at http://www.rware.demon.co.uk/encoding.htm that Manchester Encoding (used in Ethernet) requires a preamble to synchronize the clock while Differential Manchester Encoding (used in Token Ring) does not. How does that work, and why is that so? (Since both encodings guarantee signal changes halfway through each time slice, both encodings can give the same clock sync info, so I'm not sure how Manchester still requires a preamble.) Or is it just a noise issue as you've indicated on the website?
A: I'm not sure that I can answer the question, ``How does that work?'' The on-off pattern used to synchronize sender and receiver is convenient because it is a simple square wave that it easy to generate. The reason it is used, however, is merely for sychronizing purposes -- to ensure that both sender and receiver agree on when real data starts to flow. Why this approach is used for one type of encoding and not for the other is unclear to me. There may be a disciplined reason, or it may be an historical remnant: The developers discovered that the preamble wasn't strictly necessary, and rather than try to remove it from the existing Manchester encoding, they removed it from the newly developed Differential Mancherster encoding. (Note: The previous sentence is merely speculation!) It is a fine question, and I would be happy to know the answer, but it's not an issue with which you should be concerned for the exam.
Q: Why exactly are we stopping at (log n)-size chunks? When we used the definition work = (total processors) (total time) that sounded OK, because the algorithm could have used up to n/2 processors, etc. But later we seemed to be saying that an algorithm was work-efficient if the total number of parallel operations was of the same order as the total number of serial ops. And measuring work by number of operations makes much more sense to me. By that definition scanning is work-efficient even if we divide all the way down the tree, since scanning a (log n) block takes the same number of ops whether in serial or in parallel.
A: The second definition that you cite was used to provide a basic intuition to help ground the notion of ``work'', but it's not the technical definition, and for a good reason. Specifically, the work is (total processor)(total time) because those processors must *all* be occupied for the full runtime of the calculation. You could try to imagine that processors not used for particular portions of the calculation could be re-allocated to some other calculation. However, this kind of fine-grained scheduling of processors would be *extremely* difficult. Instead, the total set of processors is reserved for the calculation during its entire duration. So really, ``work'' is not the sum of all operations performed, but rather the potential number of operations that *could* be performed by the total set of processors within the given runtime. So, for this particular parallel algorithm, n/2 processors will be reserved at every step, even if some much smaller number are actually performing a calculation.
Q: I'm not sure I understand the concept of scalable algorithms - specifically the concept of efficiently scalable. I have the definitions and all, but have we seen any non-scalable algorithms? or non-efficiently scalable algorithms?
A: Firstly, a scalable algorithm is one that, basically, doesn't reach a limit on the number of processors that it can use. That is, as the input size gets larger, more processors can be applied to the larger inputs. A non-scalable algorithm will reach some limit such that, for a larger input size, a larger number of processors cannot be used to improve the runtime.
So, an efficiently-scalable algorithm is one that is both work-efficient and scalable. That is, as the input size grows, the number of processors can be made to grow at the same rate (within a constant factor) in order to keep the runtime down. If you had a parallel algorithm that, for every doubling of the input required a squaring of the number of processors would be scalable -- more processors helps -- but not efficient because such an increase in the number of processors would imply lack of work efficiency.
Q: On 11/2 we did some asymptotic bound analysis of the NESL for-each > construct do we need to understand that at all?
A: I'll give away part of the game and say, ``no''. It's good to matieral to understand, but I don't want to put NESL-specific questions on the exam.
Q:I'm not clear on how we can parallelism the outer loop in the Sieve of Erosthanes to find primes. We went over it in class, but is it actually possible?
A: Yes, it is possible, and understanding how we did it is important. Recall that to find all of the primes between 2 and n, the outer loop of the Sieve iterates over the integers between 2 and sqrt(n) in search of the primes within that range, while the inner loop then eliminates multiples of those primes between 2 and sqrt(n). We saw how to parallelize the inner loop, but that left the outer loop as a source of serialization -- the algorithm had to iterate.
So, we observed that if we simply *had* a list of the primes between 2 and sqrt(n), we could, in parallel, eliminate all multiples of the primes between 2 and sqrt(n). That is, instead of eliminating all of the multiples of 2 and *then* eliminating the multiples of 3 and *then* eliminating the multiples of 5, etc., we could simply eliminate the multiples of those primes concurrently. The big question was, ``How do we obtain that list of primes between 2 and sqrt(n)?'' The answer is, ``Recursively!''
Therefore, the findPrimes() function would first be called with n, requesting that it find the primes between 2 and n. It would then be recursively called with sqrt(n), and then with sqrt(sqrt(n))), etc., until it was called with something sufficiently small to be a base case, such as '2' or '3'.
Q: Is there a work efficient parallel version of insertion sort?
A: We were unable to come up with one. If one exists, I'd be happy to know what it is, but I don't know of one right now.
Q: What is the work bound of Array In/LL-out merge sort and if possible what is the best approach to analyzing such an algorithm?
A: Well, since sequential mergesort has a runtime of O(n lg n), and the array-in/LL-out version of parallel mergesort is work-efficient, then it must be the case that the work bound for that mergesort is O(n lg n).
I'm not sure I know how to answer the second part of the question. We worked through the analysis of array-only parallel mergesort, and observed that its limitation was the number of processors needed to copy values into a new vector to be returned. Noticing that a linked list would be much better for this particular step, we analyzed LL-only parallel mergesort in the hopes that it would fix things. Unfortunately, we found that LL-only parallel mergesort was not work efficient either, but for a different reason. In this case, the lack of efficiency was due to the inherently sequential traversal of a linked list for a median element, thus increasing the runtime too much for work efficiency.
So, we glued the two solutions together with the key observation: We need only to ``convert'' the array inputs into linked-list outputs at the very bottom level of the recursion. Since it only has to be done once, and the asymptotic bounds on this conversion sufficiently small, we were able to put together the bounds for the array-in steps with the bounds for the LL-out steps, and get something that summed to a work-efficient algorithm.
This algorithm and analysis serves as a good example of yet another way to find work efficiency. No local work trick involved. Instead, a change of data structures to eliminate the causes of work inefficiency did the trick.
Other than the number of bits, and their placement at powers of 2, and GENERALLY that each check bit is a parity bit for the bits "it's responsible for," do we need to know anything about Hamming Codes? If so, I suppose my question is: which bits are each check bit responsible for, especially the first one.
Firstly, yes, you should know (a) that Hamming codes are a general form for 1-bit error correction, and (b) how Hamming codes work, exactly.
So, to answer that second part of the question, consider a Hamming code for ASCII characters, where the number of message bits m = 7, and the number of check bits r = 4. (Note that these values satisfy the inequality derived on the mid-term regarding number of check bits for 1-bit error correction.) Further, consider that we will number the bits of a codeword starting at 1, not 0. So, we want to send the message '0110100'. We lay out those bits at the non-power-of-two indices in the codeword:
bits: 0 1 1 0 1 0 0 indices: 1 2 3 4 5 6 7 8 9 10 11
Now we want to fill in the check bits. For which message bits is each check bit responsible? Consider a check bit whose index is 2^i: That check bit is really a parity bit for all message bits whose indices break down to include an additive factor of 2^i. In other words, let's lay out the additive factors for each index number here:
index factors ----- ------- 1 2^0 2 2^1 3 2^1 + 2^0 4 2^2 5 2^2 + 2^0 6 2^2 + 2^1 7 2^2 + 2^1 + 2^0 8 2^3 9 2^3 + 2^0 10 2^3 + 2^1 11 2^3 + 2^1 + 2^0
So, the check bit at index 1 = 2^0 is a parity bit for all message bits whose indices include 2^0 as an additive factor. I this case, that means the odd-numbered message bits: 3, 5, 7, and 9. The check bit at index 2 = 2^1 is a parity bit for message bits at indices 3, 6, 7, 10, and 11. The bit at index 4 is responsible for bits 5, 6, and 7. The bit at index 8 is responsible for bits 9, 10, and 11.
The final Hamming codeword would therefore be:
bits: 0 1 0 0 1 1 0 1 1 0 0 indices: 1 2 3 4 5 6 7 8 9 10 11
Notice that no check bit is responsible for another check bit: Its only additive factor is itself. Also notice how errors can now be caught: Take any bit and flip it. Then verify the parity of the check bits. For each check bit that comes out incorrectly, write down its index value. When all of the check bits have been tested, sum up the indices of those check bits whose parity was incorrect. That sum will be the index of the bit that was flipped!
Assume I have generator function G(x) defined as: 1101 and a 7-bit message M(x): 1010110. I'm trying to divide M(x) by G(x) and my question is, where do "start" the division. That is, does G(x) 1101 "go into" the first 4-bits of M(x)1010, since all we're doing is XORing, or does it truly have to "go into" such that I have put G(x) into the first 5-bits of M(x) and then XOR?
In this context, for A to ``go into'' B requires only that B have as many bits (ignoring leading 0's) as A does. So, in your example 1101 does go into 1010, as they both have 4 bits. 1101 would *not*, however, go into 0110, as the latter has only 3 bits.