Many applications have serial I/O workloads that don't benefit from a disk array any more than single-threaded applications benefit from a parallel processor. Read latency dominates I/O performance for such serial I/O workloads, and disk arrays don't reduce latency. How can we help applications leverage disk array parallelism for low access latency?
In a larger context, the growth of distributed file systems, wide-area networks, and, yes, the Web has moved users farther from their data and added latency to data accesses. How can we help applications take full advantage of the available network bandwidth to minimize latency?
We propose that applications should issue hints which disclose their future
I/O accesses. Prefetching aggressively based on application disclosures
could do more harm than good if it caused valuable pages to be prematurely
evicted from the cache. Therefore, we need to determine when cache buffers
should be used to hold prefetched data instead of data for reuse. To address
this issue, we developed a framework for resource management based on
cost-benefit analysis. It uses a system performance model to estimate
the benefit of using a buffer for prefetching and the cost of taking a buffer from the cache. We implemented a system that computes
these estimates dynamically and reallocates a buffer from the cache for
prefetching when the benefit is greater than the cost. Look here for more information about TIP, our informed prefetching and caching system.
The cost-benefit analysis depends on accurate estimators of the benefit
of initiating an I/O, and the cost of evicting data from a buffer. We
have developed a set of estimators that take into account the layout of
data on the disks, the current state of the buffer cache, and the per-process
upcoming I/O load (determined by hints if available, or by recent activity
levels otherwise). These new estimators prefetch and cache more aggressively
for disks that will be overloaded in the future, and more conservatively
for disks whose bandwidth is sufficient to meet all demands. The resulting
algorithm is called TIPTOE: TIP with Temporal Overload Estimators.
TIP evolved out of a desire to reduce read latency. When storage is behind a network interface (either a traditional networked file system or a NASD), there is even more latency for TIP to hide. We are investigating several variants of remote TIP: A client-only version that treats remote storage as if it were a disk with higher and potentially variable latency, a mostly-server version that runs the TIP system at the storage and attempts to insure that all fetches from the client hit in the storage's cache, and a cooperative version, that exploits intelligence at both client and server.
The other half of the problem is figuring out how to modify applications so that they generate hints disclosing their future I/O accesses. To demonstrate the effectiveness of our system for informed resource management, we manually modified a suite of I/O-intensive applications to issue hints. Manual modification is not ideal, however, because it requires source code, and can require significant programming effort to ensure that hints are issued in a timely manner. Instead, we propose that a wide range of disk-bound applications could dynamically discover their own future data needs by opportunistically exploiting any unused processing cycles to perform speculative execution, an eager pre-execution of application code using the available, incomplete data state. Look here for more information about our speculative execution approach.
Hugo Patterson
Andrew Tomkins
Dave Rochberg
Fay Chang
Nat Lanza
Jim Zelenka
Garth Gibson
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.