by Greg Ganger & Joan Digney
(reprinted from the PDL Packet,
Fall 2001)
Independently, both device firmware and OS software engineers aggressively utilize their knowledge and resources to mitigate disk performance problems. At the same time, the disk firmware folks complain of short request queues; frustrating to them because they can do efficiency-scheduling better than host software, while the file system people have given up on detailed data placement, focusing instead on just trying to use "large" requests to amortize positioning times over more data transfer. An overall goal of some recent PDL projects is to increase the cooperation between these two sets of engineers, significantly increasing the end-to-end performance and robustness of the system as a whole.
The fundamental problem is that the storage interface hides details from both sides and prevents communication. For example, storage devices can schedule requests to maximize efficiency, but host software tends not to expose much request concurrency because the device firmware does not know about host priorities and considers only efficiency. Likewise, host software can place data and thus affect request locality in a variety of ways, but currently does so with only a crude understanding of device strengths and weaknesses, because detailed device knowledge is not available.
All of these difficulties could be avoided by allowing the host software
and device firmware to exchange information. The host software knows
the relative importance of requests and has some ability to manipulate
the locations that are accessed. The device firmware knows what the
device hardware is capable of in general and what would be most efficient
at any given point. Thus, the host software knows what is important
and the device firmware knows what is fast. By exploring new storage
interfaces and algorithms for exchanging and exploiting the collection
of knowledge, and developing cooperation between devices and applications,
we hope to eliminate redundant, guess-based optimization. The result
would be storage systems that are simpler, faster, and more manageable.
Figure 1: Both systems allow the host OS to make simultaneous requests to the disk, but cooperative interfaces also allow the host to tell the disk what is important and what options are acceptable (e.g., read these 10 blocks in any order or write to one of these 3 places). With this information, the disk can better specialize its actions to host needs. Cooperative interfaces also allow the disk to tell the host OS about data layout and access patterns that will work particularly well or particularly badly. The host OS can then tune its policies to match storage device strengths and avoid weaknesses. |
For the past 15 years or so, the most common storage interfaces (SCSI and IDE) have consisted mainly of the same simple read and write commands. This consistent high-level interface has enabled great portability, interoperability, and flexibility for storage devices and their vendors. In particular, the resulting flexibility has allowed such very different devices as disk drives, solid state stores, and caching RAID systems to all appear the same to host operating systems.
In the continuing struggle to keep storage devices from becoming a bottleneck to system performance and thus functionality, system designers have developed many mechanisms for both storage device firmware and host OSes. These mechanisms have almost exclusively been restricted to only one side of the storage interface or the other. In fact, evolution on the two sides of the storage interface has reached the point where each has little idea of or input on the detailed operation of the other. We believe that this separation, which once enabled great advances, is now hindering the development of more cooperative mechanisms that consist of functionality on both sides of the interface.
The goal of cooperation between host software and device firmware raises a number of questions: (1) what should change in the host to better match device characteristics? (2) what should change in device firmware to better match what the host views as important? (3) how should the storage interface be extended to make these changes possible and effective? (4) how much device-specific information can be obtained and used from outside the device? and (5) how much complexity and overhead is involved with extending disk firmware functionality?
Over the years, the host-level software that manages storage devices has lost touch with the details of device mechanics and firmware algorithms. Unlike with many other components, however, these details can have dramatic, order-of-magnitude effects on system performance. Identifying specific examples where the host-level software can change in relatively small ways to better match device characteristics will represent one important step towards actually realizing greater cooperation. One example that we are exploring is Track-Aligned Extents. Several disk drive advances in recent years have conspired to make the track a sweet spot in terms of access unit, but it only works when accesses are aligned on and sized to track boundaries. Accomplishing track-aligned extents in file systems will require several changes, including techniques for identifying the boundaries and file system support for variable sized allocation units. Our upcoming paper on this topic [Schindler01] shows that track-aligned extents can provide significant performance and predictability benefits for large file and video applications.
Agressive cache management and request scheduling policies available in todays systems, which typically go largely unused, could be active participants in scheduling if the firmware could differentiate between efficiency and system priorities. With minor extensions to the current storage interface, it should be possible to convey simple priority information to the device firmware. Freeblock Scheduling is one mechanism being explored for using this information in the firmware. By accurately predicting the rotational latencies of high-priority requests, it becomes possible to make progress on background activity with little or no impact on foreground request access times. Preliminary results [Lumb00] indicate that this can increase media bandwidth utilization by an order of magnitude and provide significant end-to-end performance improvements.
Our future research will explore interfaces and algorithms that allow even more cooperation between host software and device firmware. For example, we envision an interface that would allow the host system to direct the device to write a block to any of several locations (whichever is most efficient); the device would then return the resulting location, which would be recorded in the hosts metadata structures. Such an interface would allow the host (e.g., database or file system) and the device to collaborate when making allocation decisions, resulting in greater efficiency but no loss of host control.
Another important practical research question that must be answered in pursuing this vision is how much device-specific knowledge and low-level control is available from outside the device firmware. In particular, if complete knowledge and control is available externally, then it may be unnecessary for host software to cooperate with device firmware -- instead, the host software can directly do exactly what needs to be done, bypassing the firmware engineers entirely. Although we do not believe complete control to be possible, it is important to work on understanding how close one can get. The paper Freeblock Scheduling Outside of Disk Firmware [Lumb01] describes our experiences with OS-level freeblock scheduling.
An equally important question relates to the practical limitations
involved with working within disk firmware, for example those related
to ASIC interactions, timeliness requirements, and limited runtime support.
In some sense, this is not deep research, since disk manufacturers have
been extending firmware for years. However, it is a critical practical
consideration for any work that proposes extensions. We are working
with Seagate to experiment with freeblock scheduling inside their disk
firmware.
Modern host software (e.g., file systems or databases) performs aggressive on-disk placement and request coalescing, but generally do so with only a vague view of device characteristics -- generally, the focus is on the notion that "bigger is better, because positioning delays are amortized over more data transfer." However, there do exist circumstances where considering specific boundaries can make a big difference.
Track-aligned extents is a new approach to matching system workloads to device characteristics and firmware enhancements. By placing and accessing largish data objects on track boundaries, one can avoid most rotational latency and track crossing overheads. Specifically, accessing a full tracks data has two significant benefits: avoiding track switches and thus positioning delays, and eliminating rotational latency by accessing sectors on the media in the order that they pass under the read/write head instead of ascending LBN order. Thus, all sectors on a track can be accessed in a single rotation, regardless of which sector passes under the head first.
Combined, these benefits can increase large access efficiency by 25 to 55%, depending on seek penalties. They can also make disk access times much more predictable, which is important to real-time applications (e.g., video servers). However, these benefits are fully realized only for accesses that are track-sized and track-aligned. We believe that such accesses can be made much more common if file systems were modified to use track-aligned extents, specifically sized and aligned to match track boundaries.
Track-aligned extents are most valuable for workloads that involve
many large requests to distinct portions of the disk. Video servers
represent an ideal application. Although envisioned as a "streaming
media," video storage access patterns show requests for relatively
large segments of data read individually at a rate that allows the video
to be displayed smoothly. A video server interleaves the segment fetches
for several videos such that they all keep up. The result is non-contiguous
requests, where the discontinuities are due to timeliness requirements
rather than allocation decisions. The number of videos that can be played
simultaneously depends both on the average-case performance and the
bounds that can be placed on response times. Track-aligned extents can
substantially increase video server throughput by making video segment
fetches both more efficient and more predictable.
Disk firmware includes support for aggressive scheduling of media accesses in order to maximize efficiency. In its current form, however, this scheduling concerns itself only with overall throughput. As a result, host software does not entrust disk firmware with scheduling decisions for large sets of mixed-priority requests. Freeblock scheduling is a new approach to media bandwidth utilization and request scheduling that uses the accurate predictions needed for aggressive scheduling to combine minimized response times for high-priority requests with improved efficiency and steady forward progress for lower-priority requests. Specifically, by interleaving low priority disk activity with the normal workload, freeblock scheduling replaces the rotational latency delays of high-priority requests with background media transfers. With appropriate freeblock scheduling, background tasks can receive 20 to 50% of a disks potential media bandwidth without any increase in foreground request service times.
Fundamentally, the only time the disk head cannot be transferring data sectors to or from the media is during a seek. In fact, in most modern disk drives, the firmware will transfer a large requests data to or from the media "out of order" to minimize wasted time; this feature is sometimes referred to as zero-latency or immediate access. While seeks are unavoidable costs associated with accessing desired data locations, rotational latency is an artifact of not doing something more useful with the disk head. Since disk platters rotate constantly, a given sector will rotate past the disk head at a given time, independent of what the disk head is doing up until that time, offering an opportunity for something more useful to be done. Freeblock scheduling consists of predicting how much rotational latency will occur before the next foreground media transfer, squeezing some additional media transfers into that time, and still getting to the destination track in time for the foreground transfer.
Anticipated applications for freeblock scheduling include scanning
large portions of disk contents, e.g. data mining of an active transaction
processing systems, which showed that over 47 full scans per day of
a 9GB disk can be made with no impact on OLTP performance, and internal
storage optimization, e.g. placing related data contiguously for sequential
disk access or segment cleaning in log-structured file systems which
resulted in a 300% speedup for application benchmarks.
[Schindler01] Track-aligned Extents: Matching Access Patterns to Disk Drive Characteristics. Jiri Schindler, John Linwood Griffin, Christopher R. Lumb, Gregory R. Ganger. To appear, Conference on File and Storage Technologies (FAST), January 28-30, 2002, Monterey, CA.
[Lumb00] Towards Higher Disk Head Utilization: Extracting "Free" Bandwidth From Busy Disk Drives. Christopher R. Lumb, Jiri Schindler, Gregory R. Ganger, David F. Nagle and Erik Riedel. Proc. of the 4th Symposium on Operating Systems Design and Implementation, 2000.
[Lumb01] Freeblock Scheduling Outside of Disk Firmware. Christopher R. Lumb, Jiri Schindler, Gregory R.Ganger. To appear, Conference on File and Storage Technologies (FAST), January 28-30, 2002. Monterey, CA.