PARALLEL DATA LAB

The TIP Application Suite

MCHF Davidson Algorithm

The Multi-Configuration Hartree-Fock, MCHF, is a suite of computational-physics programs which we obtained from Vanderbilt University where they are used for atomic-physics calculations. The Davidson algorithm is an element of the suite that computes, by successive refinement, the extreme eigenvalue-eigenvector pairs of a large, sparse, real, symmetric matrix stored on disk. In our test, the size of this matrix is 16.3 MB.

The Davidson algorithm iteratively improves its estimate of the extreme eigenpairs by computing the extreme eigenpairs of a much smaller, derived matrix. Each iteration computes a new derived matrix by a matrix-vector multiplication involving the large, on-disk matrix. Thus, the algorithm repeatedly accesses the same large file sequentially. Annotating this code to give hints was straightforward. At the start of each iteration, the Davidson algorithm discloses the whole-file, sequential read anticipated in the next iteration.

XDataSlice

XDataSlice (XDS) is an interactive scientific visualization tool developed at the National Center for Supercomputer Applications at the University of Illinois. Among other features, XDS lets scientists view arbitrary planar slices through their 3-dimensional data with a false color mapping. The datasets may originate from a broad range of applications such as airflow simulations, pollution modelling, or magnetic resonance imaging, and tend to be very large.

It is often assumed that because disks are so slow, good performance is only possible when data is in main memory. Thus, many applications, including XDS, require that the entire dataset reside in memory. Because memory is still expensive, the amount available often constrains scientists who would like to work with higher resolution images and therefore larger datasets. Informed prefetching invalidates the slow-disk assumption and makes out-of-core computing practical, even for interactive applications. To demonstrate this, we added an out-of-core capability to XDS. To render a slice through an in-core dataset, XDS iteratively determines which data point maps to the next pixel, reads the datum from memory, applies false coloring, and writes the pixel in the output pixel array. To render a slice from an out-of-core dataset, XDS splits this loop in two. Both to manage its internal cache and to generate hints, XDS first maps all of the pixels to data-point coordinates and stores the mappings in an array. Having determined which data blocks will be needed to render the current slice, XDS ejects unneeded blocks from its cache, gives hints to TIP, and reads the needed blocks from disk. In the second half of the split loop, XDS reads the cached pixel mappings, reads the corresponding data from the cached blocks, and applies the false coloring.

Our test dataset consists of 5123 32-bit floating point values requiring 512 MB of disk storage. The dataset is organized into 8 KB blocks of 16x16x8 data points and is stored on the disk in Z-major order. Our test renders 25 random slices through the dataset.

Sphinx

Sphinx is a high-quality, speaker-independent, continuous-voice, speech-recognition system. In our experiments, Sphinx is recognizing an 18-second recording commonly used in Sphinx regression testing.

Sphinx represents acoustics with Hidden Markov Models and uses a Viterbi beam search to prune unpromising word combinations from these models. To achieve higher accuracy, Sphinx uses a language model to effect a second level of pruning. The language model is a table of the conditional probability of word-pairs and word-triples. At the end of each 10 ms acoustical frame, the second-level pruner is presented with the words likely to have ended in that frame. For each of these potential words, the probability of it being recognized is conditioned by the probability of it occurring in a triple with the two most recently recognized words, or occurring in a pair with the most recently recognized word when there is no entry in the language model for the current triple. To further improve accuracy, Sphinx makes three similar passes through the search data structure, each time restricting the language model based on the results of the previous pass.

Sphinx, like XDS, came to us as an in-core only system. Since it was commonly used with a dictionary containing 60,000 words, the language model was several hundred megabytes in size. With the addition of its internal caches and search data structures, virtual-memory paging occurs even on a machine with 512 MB of memory. We modified Sphinx to fetch from disk the language model's word-pairs and word-triples as needed. This enables Sphinx with TIP to run on our 128 MB test machine 90% as fast as on a 512 MB machine.

We additionally modified Sphinx to disclose the word-pairs and word-triples that will be needed to evaluate each of the potential words offered at the end of each frame. Because the language model is sparsely populated, at the end of each frame there are about 100 byte ranges that must be consulted, of which all but a few are in Sphinx's internal cache. However, there is a high variance on the number of pairs and triples consulted and fetched, so storage parallelism is often employed.

Agrep

Agrep, a variant of grep, was written by Wu and Manber at the University of Arizona. It is a full-text pattern matching program that allows errors. Invoked in its simplest form, it opens the files specified on its command line one at a time, in argument order, and reads each sequentially.

Since the arguments to Agrep completely determine the files it will access, Agrep can issue hints for all accesses upon invocation. Agrep simply loops through the argument list and informs the file system of the files it will read. When searching data collections such as software header files or mail messages, hints from Agrep frequently specify hundreds of files too small to benefit from history-based readahead. In such cases, informed prefetching has the advantage of being able to prefetch across files and not just within a single file.

In our benchmark, Agrep searches 1349 kernel source files occupying 2922 disk blocks for a simple string that does not occur in any of the files.

Gnuld

Gnuld version 2.5.2 is the Free Software Foundation's object code linker which supports ECOFF, the default object file format under OSF/1. Gnuld performs many passes over input object files to produce the output linked executable. In the first pass, Gnuld reads each file's primary header, a secondary header, and its symbol and string tables. Hints for the primary header reads are easily given by replicating the loop that opens input files. The read of the secondary header, whose location is data dependent, is not hinted. Its contents provide the location and size of the symbol and string tables for that file. A loop splitting technique similar to that in XDataSlice is used to hint the symbol and string table reads. After verifying that it has all the data needed to produce a fully linked executable, Gnuld makes a pass over the object files to read and process debugging symbol information. This involves up to nine small, non-sequential reads from each file. Fortunately, the previously read symbol tables determine the addresses of these accesses, so Gnuld loops through these tables to generate hints for its second pass.

During its second pass, Gnuld constructs up to five shuffle lists which specify where in the executable file object-file debug ging information should be copied. When the second pass completes, Gnuld finalizes the link order of the input files, and thus the organization of non-debugging ECOFF segments in the executable file. Gnuld uses this order information and the shuffle lists to give hints for the final passes.

Our test links the 562 object files of our TIP kernel. These objects file comprise approximately 64 MB, and produce an 8.8 MB kernel.

Postgres

Postgres version 4.2 is an extensible, object-oriented relational database system from the University of California at Berkeley. In our test, Postgres executes a join of two relations. The outer relation contains 20,000 unindexed tuples (3.2 MB) while the inner relation has 200,000 tuples (32 MB) and is indexed (5 MB). We run two cases. In the first, 20% of the outer relation tuples find a match in the inner relation. In the second, 80% find a match. One output tuple is written sequentially for every tuple match.

To perform the join, Postgres reads the outer relation sequentially. For each outer tuple, Postgres checks the inner relation's index for a matching inner tuple and, if there is one, reads that tuple from the inner relation. From the perspective of storage, accesses to the inner relation and its index are random, defeating sequential readahead, and have poor locality, defeating caching. Thus, most of these inner-relation accesses incur the full latency of a disk read.

To disclose these inner-relation accesses, we employ a loop-splitting technique similar to that used in XDS. In the precomputation phase, Postgres reads the outer relation (disclosing its sequential access), looks up each outer-relation tuple address in the index (unhinted), and stores the addresses in an array. Postgres then discloses these precomputed block addresses to TIP. In the second pass, Postgres rereads the outer relation but skips the index lookup and instead directly reads the inner-relation tuple whose address is stored in the array.

The I/O Bottleneck

Disk arrays eliminate the I/O bottleneck, right? Wrong!

Many applications have serial I/O workloads that don't benefit from a disk array any more than single-threaded applications benefit from a parallel processor. Read latency dominates I/O performance for such serial I/O workloads, and disk arrays don't reduce latency. How can we help applications leverage disk array parallelism for low access latency?

In a larger context, the growth of distributed file systems, wide-area networks, and, yes, the Web has moved users farther from their data and added latency to data accesses. How can we help applications take full advantage of the available network bandwidth to minimize latency?

Informed Resource Management: Disclosure Hints and Cost-Benefit Analysis

We propose that applications should issue hints which disclose their future I/O accesses. Prefetching aggressively based on application disclosures could do more harm than good if it caused valuable pages to be prematurely evicted from the cache. Therefore, we need to determine when cache buffers should be used to hold prefetched data instead of data for reuse. To address this issue, we developed a framework for resource management based on cost-benefit analysis. It uses a system performance model to estimate the benefit of using a buffer for prefetching and the cost of taking a buffer from the cache. We implemented a system that computes these estimates dynamically and reallocates a buffer from the cache for prefetching when the benefit is greater than the cost. Look here for more information about TIP, our informed prefetching and caching system.

Integrating Disk Management into the Cost-Benefit Analysis

The cost-benefit analysis depends on accurate estimators of the benefit of initiating an I/O, and the cost of evicting data from a buffer. We have developed a set of estimators that take into account the layout of data on the disks, the current state of the buffer cache, and the per-process upcoming I/O load (determined by hints if available, or by recent activity levels otherwise). These new estimators prefetch and cache more aggressively for disks that will be overloaded in the future, and more conservatively for disks whose bandwidth is sufficient to meet all demands. The resulting algorithm is called TIPTOE: TIP with Temporal Overload Estimators.

Informed Remote Prefetching

TIP evolved out of a desire to reduce read latency. When storage is behind a network interface (either a traditional networked file system or a NASD), there is even more latency for TIP to hide. We are investigating several variants of remote TIP: A client-only version that treats remote storage as if it were a disk with higher and potentially variable latency, a mostly-server version that runs the TIP system at the storage and attempts to insure that all fetches from the client hit in the storage's cache, and a cooperative version, that exploits intelligence at both client and server.

 

Automatic Hint Generation through Speculative Execution

The other half of the problem is figuring out how to modify applications so that they generate hints disclosing their future I/O accesses. To demonstrate the effectiveness of our system for informed resource management, we manually modified a suite of I/O-intensive applications to issue hints. Manual modification is not ideal, however, because it requires source code, and can require significant programming effort to ensure that hints are issued in a timely manner. Instead, we propose that a wide range of disk-bound applications could dynamically discover their own future data needs by opportunistically exploiting any unused processing cycles to perform speculative execution, an eager pre-execution of application code using the available, incomplete data state. Look here for more information about our speculative execution approach.