An important trend in the design of storage subsystems is a move toward
direct network attachment and increased intelligence at storage devices.
Network-attached storage offers the opportunity to offload file system
and storage management functionality from dedicated server machines
and execute many requests directly at storage devices without server
intervention. Raising the level of the storage interface above the simple
linear address space of SCSI allows more efficient operation at the
device and promises more scalable subsystems. This work takes this interface
one step further and suggests that allowing application-specific code
to be executed at storage devices on behalf of clients/servers can make
more effective use of device, client and interconnection resources and
considerably improve application I/O performance. Remote execution of
code directly at storage devices allows filter operations to be performed
close to the data; allows optimization of timing-sensitive transfers
by taking advantage of application-specific knowledge at the storage
device; allows management functions to be customized and updated without
requiring firmware upgrades; and makes possible complex or specialized
operations than a general-purpose storage interface would normally support.
The processing power available
on disk drives is rapidly increasing. Modern SCSI disks contain microprocessors
that are only three or four generations behind top-of-the-line host
processors. A high-end drive available from Quantum today is driven
by a 25 MHz Motorola 68020 along with a single specialized chip for
handling SCSI, servo, and disk functions. Improvements in chip technology
make it conceivable to include an integrated 100 MHz RISC core in the
same die space as the current ASIC and still leave room for the additional
cryptographic and network processing required by network-attached disks.
The graphic below shows how the electronics of the drive on the left
have shrunk into a single ASIC that includes all the basic drive functions.
By moving from the current 0.68 micron to a 0.35 micron silicon process,
we free up die area for additional networking and security functions
and make room for an embedded 200 MHz StrongARM to replace the microcontroller
in the original drive. All drive control is now combined into a single
chip with a significant amount of additional computing power.
This diagram was created by photo-reducing an image of the existing
Trident ASIC, but chips similar to the one shown have been announced
by Siemens (Tri-Core -
100 MHz RISC core, up to 2 MB of memory, up to 500 MIPS within 2 years),
Cirrus Logic (3CI - ARM7 core,
moving to 200 MHz ARM9 core in the second generation), and Texas Instruments
(C27x -
150 MIPS in the first generation, 16 MB address space). |
This microprocessor is not involved
in the balance of the fastpath processing on the drive and will have
cycles to spare in normal operation. We propose Active Disks as a way
to take advantage of these cycles to provide value-added processing
directly at disks. Active Disks allow application-specific code to execute
inside the disk in order to reduce load on the network, offload processing
currently done by clients/servers, and enable novel functions that can
take advantage of closer knowledge of the disks' internal state than
a general storage interface can provide.
Table 1: If we estimate that current disk drives have the equivalent of 25 MHz of host processing speed available, large database systems today already contain more processing power on their combined disks than at the server processors. Assuming a reasonable 10 MB/s for sequential scans, we also see that the aggregate storage bandwidth is more than twice the backplane bandwidth of the machine in almost every case. Data from [TPC98] and [Barclay97]. |
With drive media rates at 20 MB/s
today (e.g. Seagate's Cheetah) and expected to reach 30 MB/s by the end of the decade,
the limiting factor in disk I/O will no longer be the drive mechanics,
but the “network” latency to access the devices and the processing
required at clients/servers to manage the I/O. Processing that can be
performed cheaply at the disks will offload the other system components
and can significantly improve overall user-visible performance. The
table above shows that even with relatively low-powered processors (assuming
the 25 MHz microprocessors already in drives today) and low drive bandwidths
(a modest 10 MB/s) the aggregate processing power available on the disks
attached to most large database servers already exceeds that of the
server CPUs. Even more importantly, the I/O backplanes of these machines
cannot keep up with the total throughput available from the storage
devices. Allowing processing directly at disks greatly increases total
computational power and allows application-level throughput at the level
of what the storage devices can provide.
The most promising candidate applications for Active Disks will be able to leverage the parallelism in highly concurrent workloads by striping across a large number of drives. The ability to leverage the processing power of tens or 100s of disks can more than compensate for the lower relative MIPS of single drives compared to host processors. On-drive computations should be localized to small amounts of data, essentially performing a small amount of processing as data streams past from the disk media on its way to the network. Remote functions should have small code/cycle footprint per byte processed in order to keep data moving at near media rates and allow scheduling of remote computation with normal drive activity. The ability to access internal drive state and take advantage of on-drive scheduling mechanisms enables a range of storage management and real time functions that are not possible with today's interfaces.
We have identified a set of five categories of applications that may benefit from Active Disks, each of which take advantage of a different set of on-drive features:
We have experimented with a number of applications in data mining and multimedia that could benefit from an Active Disk architecture.
The first application we looked at is association rule discovery in
point-of-sale data. The purpose of the application is to extract rules
of the form if a customer purchases item A and B, then they are
also likely to purchase item X which can be used for store layout
or inventory decisions. The computation is done in several passes, first
determining the items that occur most often in the transactions (the 1-itemsets) and then using this information to generate pairs
of items that occur often (2-itemsets) and larger groupings (k-itemsets).
For the Active Disks system, the counting portion of each phase is performed
directly at the drives. The server produces the list of candidate k-itemsets
and provides this list to each of the disks. Each disk counts its portion
of the transactions locally, and returns these counts to the server.
The server then combines these counts and produces a list of candidate
(k+1)-itemsets which are sent back to the disks. This application reduces
an arbitrarily large number of transactions in a database into a single,
variably-sized set of summary statistics - the itemset counts - that
can be used to determine relationships in the data.
|
Our second application is an implementation of nearest- neighbor search
in a high- dimensionality database. We determine the k items
in a database of loan records that are closest to a particular input
item. For the Active Disk system, all the comparisons are done directly
at the drives. The server sends the target record to each of the disks
which determine the k closest records in their portions of the
database. These lists are then combined to determine the overall closest
records. Again, we see the traditional server bottleneck at a low number
of disks while the Active Disk system continues to scale to more than
2x faster with only 10 disks. For high- dimensionality data, traditional
indices lose much of their effectiveness and “brute force”sequential
scanning, which Active Disks are particularly good at, is competitive
with more complex methods using high-dimensional indices. |
For image processing, we looked at an application that detects edges
in a set of grayscale images. We use real images from IBM Almaden's
CattleCam and attempt to detect cows in the landscape above San Jose.
The application processes a set of 256 KB images and returns only the
edges found in the data using a fixed 37 pixel mask. The intent is to
model a class of image processing applications where only a particular
set of features (e.g. the edges) in an image are important, rather than
the entire image. This includes tracking, feature extraction, and positioning
applications that operate on only a small subset of the original images
data. This application is significantly more computation-intensive than
the comparisons and counting of the data mining applications.
|
Using the Active Disks system, edge detection for each image is performed
directly at the drives and only the edges are returned to the server.
A request for the raw image at the left returns only the data on the right,
which can be represented much more compactly. |
Our second image processing application performs the image registration
portion of the processing of an MRI brain scan analysis. Image registration
determines the set of parameters necessary to register (rotate and translate)
an image with respect to a reference image in order to compensate for
movement of the subject during scanning. The application processes a
set of images and returns the registration parameters for each image.
This application is the most computationally intensive of the ones studied
because the algorithm includes two FFT computations. For the Active
Disks system, this application operates similarly to the edge detection.
The reference image is provided to all the drives and the registration
for each image is calculated directly at the drives with only the final
parameters returned to the server.
|
The scaling is as before, with the server becoming bottlenecked after
about four disks, while the Active Disk system continues to scale. We
also see that the image processing applications are much more computationally
intensive than the data mining applications, achieving only 4.5 MB/s
and 650 KB/s of aggregate throughput with ten Active Disks. This is
far below the aggregate bandwidth possible from the media, but the computation
power of the Active Disks still moves it ahead of the server by a margin
of more than 2x. |
The existence of an execution environment at the drive makes it possible to provide management functions that are either more complex than drive firmware would normally allow or that are customized to the environment in which the drive is installed. For example, a backup function that took into account the configuration and workload patterns of a specific environment might be more efficient than a function provided in the drive firmware that had to support all possible environments and requirements. Having such management functions operate as remote programs also allows them to be updated to extended without rewriting the entire drive firmware. Possible management functions include backup; layout optimization based on filesystem- or application- specific knowledge, or on usage patterns observed at the drive; defragmentation; and reconfiguration.
As the use of multimedia becomes more widespread, there are a number of applications that require strict performance guarantees from storage systems. In existing systems, the accepted way to guarantee bandwidth is by having significant over-capacity in the system to ensure that peak requirements can be met. Applications which have soft real-time requirements, such as a streaming video display seeking to minimize jitter, require that specific deadlines be met, but also have properties that allow some flexibility. For example, an MPEG compressed movie has variable bandwidth requirements (due to non-uniform compression) which must usually be specified at the maximum rate to ensure delivery. It is also possible to drop frames - particularly I and B intermediate frames without affecting the display quality, as long as enough frames get through. Active Disks allow a video application to take advantage of these properties to schedule its use of storage and smooth its requirements, thereby allowing more efficient use of resources.
Specialized functions that require
specific semantics not normally provided by drives can be provided by
remote functions on Active Disks. This allows functionality specialized
to a particular environment or usage pattern to be executed where the
semantics are most efficiently implemented, rather than requiring additional
overhead in the higher levels of the system. Examples include a READ/MODIFY/WRITE operation or an atomic CREATE that would
both create a new file object and update the corresponding directory
object, for optimization of higher-level filesystems such as NFS on
NASD [Gibson97a]
One of the important questions for a remote execution system is what programming model is provided for user-defined functions. A well-defined and limited set of interfaces such as those provided by packet filters or SQL allow control over the safety and (to some extent) the efficiency of user-provided functions, but also limit the richness of functions that can be implemented. Providing a type-safe programming language and depending on a combination of compile-time and run-time checks provides greater flexibility, but requires careful design of the system interfaces allowed to user-defined functions to prevent holes in the safety mechanism [McGraw97] and may put a significant cost on the run-time system (e.g. Java ). Object-level editing allows the insertion of run-time safety checks into compiled programs from any source language, but imposes a translation cost, a possibly significant run-time cost and again requires very careful design of the system interfaces [Lucco93, Software Fault Isolation]. A system such as proof-carrying code moves the burden of ensuring safety from the run-time system to the code producer [Necula96] , but may limit the complexity of the programs that can be expressed.
Our work on Active Disks builds on our previous work in Network-Attached Secure Disks (NASD) which proposes making disks first-class citizens on the general-purpose network [Gibson97] and raising the storage interface above the simple block-level protocol of SCSI [Gibson97a] . Both the object oriented (rather than block oriented) interface to storage and the security system of NASD provide a solid base for Active Disk functions. These allow access and control at a coarse enough granularity that drives and on-drive functions can operate relatively autonomously, while retaining control and basic policy decisions in a central set of file managers.
Beyond this basic interface, there are a there are a number of functions
that make Active Disks more powerful and flexible. These must include
this basic filesystem API, as provided to regular clients, as well as
a form of input/output with the host application. More advanced functions
might benefit from asynchronous “callbacks” with the host,
some form of long-term state at the drive, the ability to specify processing
deadlines, including perhaps real-time guarantees, and admission control
to manage drive functions. In order to take full advantage of optimizations
at the drive, the ability for remote functions to inquire about the
state of the cache and block layout, and to control caching and layout
through a local interface would all be beneficial. Finally, some applications
(storage management in particular) might desire the capability to open
communication with 3rd parties (e.g. a tape device). The relative costs
and benefits of these different functions are a central open issue in
the design of an Active Disk environment.
We thank the members and companies of the PDL Consortium: Amazon, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.