Here is a list of my publications, along with descriptions of each paper and links to downloadable copies. If you have questions or comments, please send me email. If you would like to find the traces, simulators, or other utilities used in the experiments described in these papers, visit my research page.
Thanks to the widespread adoption of languages such as Java and C#, garbage collection (GC) is broadly used and affects the performance of more systems. Oddly, a GC must be assigned a static heap size at initialization time. This value is assigned without regard to the memory demands of the mutator and without awareness of the availability of main memory space. If main memory is abundant, the heap size may be too small and thus excessive amounts of time may be spent on needless collection. Worse, if main memory is scarce, each full-heap collection, in a exhibition of horrendous locality, will cause heavy page swapping.
The heap size must be determined in response to the availability of main memory. In multiprogrammed systems, that availability changes dynamically, and so the heap size must do so as well. We present automatic heap sizing for garbage collectors.
Our system requires a modified virtual memory manager (VMM) at the operating system that monitors the reference pattern of the mutator and the garbage collector. The garbage collector, after each full heap collection, then obtains from the VMM the footprint of the process (that is, the space required to avoid any significant paging) and the current main memory space allocated to the process. The GC applies a linear function to this information to calculate a new heap size that will fit into the currently allocated space. We present simulations that demonstrate the efficacy of this approach.
A number of methods have been developed to collect reference traces, but most of them only log the references of a single process. Those that do log references from multiple processes do not also log file system operations, and do not identify the use of shared space. In fact, in previous work there are incorrect assertions about how to identify the uses of shared space.
Laplace is a reference trace collector built on top of Bochs, the open-source x86 machine simulator. It logs not only virtual memory references at the simulated CPU level, and it logs many kernel-level events such as file system operations, memory mappings, and process fork, exit, and scheduling operations. This information is collected for all processes and the kernel. With this information, the Laplace post-processor can re-construct the state of each process and kernel-level thread, identify all shared memory, and associate each reference with its thread.
Previous reference trace collectors were frequently optimized to capture reference traces efficiently. However, we find that capturing each reference -- that is, interfering with normal execution to gain control at reference time -- is less important than handling the captured references -- that is, logging or processing the reference information. Even a slow reference trace collector produces so much data that the handling overhead, whether caused by writing to disk, compressing, or processing in an on-line simulation, is prohibitive. Any collector, like Laplace, that captures ever virtual memory reference is bound to incur a slowdown factor of many hundreds.
A refined version of the EELRU page replacement algorithm. It includes new experimental results and proofs of EELRU properties.
Refined versions of the SAD algorithm, including the supplemental SAD2 algorithm. A proof of the optimality of OLR. Flexible uses of both algorithms for varying trace annotations, including tracking dirtiness and timestamps.
Because disk latencies are large but disk transfer rates are also large, the idea of demand prepaging -- fetching multiple pages at each page fault -- is an attractive one. It is any idea that had been evaluated decades ago, during the earlier stages of virtual memory (VM) exploration. The results of these studies were ambiguous: prepaging seemed only to offer benefit on larger memories and on systems with very small pages. Researchers observed that prepaging often did harm. When extra pages are fetched at a page fault, other resident pages must be evicted, where these other pages would be used soon after their eviction.
In spite of the ambiguous results, real systems seem to employ prepaging. Some do so in a conservative manner by allocating very little main memory space to prepaged pages, thus avoiding the harm that prepaging can do. Other systems treat prepaged pages as though they had been referenced, potentially polluting main memory with many prepaged pages.
This paper examines the problem of allocating main memory space for prepaged pages. We find that for some processes, allocation of a modest fraction of main memory for prepaged pages can reduce paging by tens of percent. For other processes, allocation of any space to prepaged pages can only increase paging. We then propose a dynamically adaptive mechanism that, by the references to both used pages and prepaged pages, selects on-line an appropriate fraction of memory for prepaged pages. This mechanism is then able to allow prepaging to obtain benefit where possible, but to prevent harm when prepaging is undesirable.
This dissertation contains slightly expanded versions of the three papers listed below, as well as other associated material. In particular, it addresses the need for controlled experiments in VM systems through simulation. New methods of trace reduction for VM simulation are described, and empirical evidence of their efficacy is given. EELRU is described as a method of adaptive caching. Compressed caching is addressed as beneficial caching strategy, and simulations demonstrate the reductions in paging time possible with this method.
Added to that material from those papers is a discussion of virtual memory time and timescale relativity. These concepts allow a system to decay information from past reference behavior at the rate dictated by virtual memory events. This form of decay is used throughout the ideas presented in the dissertation, and is essential to their robust performance. These concepts also make it possible to view the problem of caching at the correct scale, and to understand how some policies are doomed to fail given phase changes and some kinds of pathological reference behavior.
Finally, this dissertation provides a more in depth presentation of the WK compression algorithms and the compression of in-memory data. Empirical evidence is given to show that very simple and strong regularities hold for in-memory data, making it quickly and effectively compressible.
Compressed caching bridges the large and exponentially growing gap between RAM and disk speeds, effectively building a new level of memory hierarchy between uncompressed RAM and disk.
In this paper, we show that previoius mixed results for compressed caching were largely due to an insufficiently large gap between CPU and disk speeds -- a gap that is now sufficiently large. Extensive detailed simulations show that compressed caching memory should yield major improvements in paging performance now -- tens of percent -- and still more in the future, since processor speeds double relative to disk speeds every two or three years.
We also present novel algorithms for in-memory data compression, and for the on-line adaptation of compressed cache size. Our compression algorithms exploit regularities common in in-memory data, as opposed to file data, and achieve excellent compression while being very fast. Our adaptive compressed cache sizing is based on an online cost-benefit analysis, to adjust the size of the compressed cache to suit the locality characteristics of a running program. Fast compression and proper adaptation significantly increases the effectiveness of compressed caching.
LRU (or LRU-like) replacement is the nearly universal standard, because for 30 years it has seemed very hard to beat. Most actual replacement policies (e.g., Clock algorithms) attempt to approximate LRU, diverging from it primarily for implementation efficiency reasons. Many attempts to systematically beat LRU have largely failed, partly due to a basic lack of understanding of LRU's great strengths as well as its weaknesses. In this paper, we show how to beat LRU on average by emulating it in its good cases, and diverging radically from it in LRU's most common (and well-known) bad cases, which are very bad indeed.
Our EELRU algorithm adaptively approximates LRU or a modified MRU, as appropriate. It uses an online cost/benefit analysis to estimate costs and benefits of different eviction strategies, based on recent program behavior. Unlike many adaptive replacement policies, EELRU is well-behaved in two ways: (1) it keys off of features of program behavior that are directly relevant to caching decisions, at the appropriate timescale, and (2) it is competitive with LRU---it never does worse than LRU by more than a modest constant factor.
We show through simulation that EELRU works as it was designed to---it detects and exploits the same regularities LRU does, and others besides, yielding significan miss rate reductions. More importantly, it illustrates constraints on the behavior of any replacement algorithm that is robust---aggregate behavior of large sets of pages is often more important than distinguishing between the behaviors of individual pages, as some attempts at "smarter" replacement do, and attending to these large-scale regularities is typically sufficient for excellent performance. Furthermore, an online cost/benefit analysis can be simple and effective, if it attends only to events at the timescale that matters to the decisions it must make.
There is a well know problem with reference traces: size. These traces can grow to gigabytes in size for just a few seconds of program execution time. The storage of a reference trace can be difficult. Sharing of these traces rarely happens because the transfer of them is formidable. Worse, virtual memory experiments are limited in the number of variables and the range of those variables, as the time required to process a trace in simulation can be long. As such, reducing the trace size with a lossless compression algorithm is not enough. Rather, lossy methods that discard reference information not needed for accurate simulations are required.
Most such lossy reduction methods are designed for hardware cache simulation, and depend on sampling techniques. However, the simulation of virtual memory performance has become very important, specifically the management of main memory (RAM) and the avoidance of expensive hits to the backing store (disk). The methods available for lossy trace reduction that are appropriate for virtual memory simulation are old, and discard more information than is needed.
In this paper, we present Safely Allowed Drop (SAD) and Optimal LRU Reduction (OLR). These two methods allow the user to choose the smallest LRU memory to be simulated. The resulting reduced trace is then guaranteed not to affect the exact sequence of fetches and evictions for any LRU memory as large or larger than the size chosen for reduction. (SAD additionally guarantees the exact sequence of fetches and evictions for the offline, optimal paging policy, OPT.) In other words, these reduction policies introduce absolutely no error for LRU (and, for SAD, OPT) simulations, provided the simulated memory is larger than the memory size chosen at reduction time.
Moreover, this paper demonstrates that most real policies that do not directly employ LRU are very similar to LRU. As such, we empirically demonstrate that if the simulated memory size is at least five times larger than the reduction memory size, common policies like clock and segmented queue, which approximate LRU, suffer little error -- less than 6% over 15 real reference traces, and for most memory sizes less than 1%. These error values compare favorably to the Stack Deletion method that is often used and cited as the most common trace reduction method for virtual memory simulations.
Both algorithms can be used online, during trace collection time, such that the original reference trace is never stored. Their space overheads are not significant. Implementations are freely available from our trace reduction page.