Database systems are among the most important application domains for
storage systems. Currently, we are approaching database research on
several fronts, with a common goal of identifying innovative ways to
improve system performance by better matching database systems to the
underlying hardware.
Traditional database system architectures face a rapidly evolving operating
environment, where millions of users store and access terabytes of data.
In order to cope with increasing demands for performance high-end DBMS
employ parallel processing techniques coupled with a plethora of sophisticated
features. However, the widely adopted, work-centric, thread-parallel
execution model entails several shortcomings that limit server performance
when executing workloads with changing requirements. Moreover, the monolithic
approach in DBMS software has led to complex and difficult to extend
designs. This project introduces a staged design for high-performance,
evolvable DBMS that is easy to tune and maintain. We propose to break
the database system into modules and to encapsulate them into self-contained
stages connected to each other through queues. The staged, data-centric
design remedies the weaknesses of modern DBMS by providing solutions
at both a hardware and a software engineering level.
The Staged Database System design: Each stage has its own queue and thread support. New queries queue up in the first stage, they are encapsulated into a "packet", and pass through the five stages shown on the top of the figure. A packet carries the querys "backpack:" its state and private data. Inside the execution engine a query can issue multiple packets to increase parallelism. |
When optimizing cache utilization, data placement is extremely important.
In commercial DBMSs, data misses in the cache hierarchy are a key memory
bottleneck. The traditional data placement scheme used in DBMSs, the
N-ary Storage Model (NSM), stores records contiguously starting from
the beginning of each disk page, and uses an offset (slot) table at
the end of the page to locate the beginning of each record and offers
high intra-record locality. Recent research, however, indicates that
cache utilization and performance is becoming increasingly important
on modern platforms. This project first demonstrates that in-page data
placement is the key to high cache performance and that NSM exhibits
low cache utilization on modern platforms. Therefore, we propose a new
data organization model called PAX (Partition Attributes Across), that
significantly improves cache performance by grouping together all values
of each attribute within each page. Because PAX only affects layout
inside the pages, it incurs no storage penalty and does not affect I/O
behavior. When compared to NSM, our experimental results indicate that
(a) PAX exhibits superior cache and memory bandwidth utilization, saving
at least 75% of NSM's stall time due to data cache accesses, (b) range
selection queries and updates on memory-resident relations execute 17-25%
faster, and (c) TPC-H queries involving I/O execute 11-48% faster. We
have also found that PAX performs well across different memory system
designs.
|
PAX/NSM speedup for DSS queries: Shows PAX/NSM speedups when running range selections and four TPC-H queries against a 100, 200, and 500-MB TPC-H database on top of the Shore storage manager. Decision-support systems are especially processor- and memory-bound, and PAX outperforms NSM for all these experiments. The speedups obtained are not constant across the experiments due to a combination of differing amounts of I/O and interactions between the hardware and the algorithms being used. |
We are exploring the latency hiding approaches, cache prefetching and I/O prefetching, to improve database performance. We have used cache prefetching to improve main memory B+tree performance (SIGMOD'01), studied fractal prefetching B+trees to optimize both disk and cache performance with a single index structure (SIGMOD'02), and improved hash join performance with prefetching (submitted). We are now focusing on prediction schemes using history information to improve general join algorithms.
Fractal Prefetching B+-Trees (fpB+-Trees) are a type of B+-Tree that optimize both cache and I/O performance by embedding "cache-optimized" trees within "disk-optimized" trees. This improves CPU cache performance in traditional B+-Trees for indexing disk resident data and I/O performance in B+-Trees optimized for cache. At a coarse granularity an fpB+-Tree contains disk-optimized nodes that are roughly the size of a disk page; at a fine granularity, it contains cache-optimized nodes that are roughly the size of a cache line. The fpB+-Tree is referred to as "fractal" because of its self-similar "tree within a tree" structure.
Recently, researchers have proposed new types of B+-Trees optimized
for CPU cache performance in main memory environments, where the tree
node sizes are one or a few cache lines. Unfortunately, due primarily
to this large discrepancy in optimal node sizes, existing disk-optimized
B+-Trees suffer from poor cache performance while cache-optimized B+-Trees
exhibit poor disk performance. We propose fractal prefetching B+-Trees
(fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized"
trees, in order to optimize both cache and I/O performance. We design
and evaluate two approaches to breaking disk pages into cache-optimized
nodes: disk-first and cache-first. These approaches are somewhat biased
in favor of maximizing disk and cache performance, respectively, as
demonstrated by our results. Both implementations of fpB+-Trees achieve
dramatically better cache performance than disk-optimized B+-Trees:
a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement
for range scans, and up to a 20-fold improvement for updates, all without
significant degradation of I/O performance. In addition, fpB+-Trees
accelerate I/O performance for range scans by using jump-pointer arrays
to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's
DB2 Universal Database.
Self-similar "tree within a tree" structure. |
Today's commodity disk drives, the basic unit of storage for computer systems large and small, are actually small computers, with a processor, memory, and 'network' connection, along with the spinning magnetic material that permanently stores the data. As more and more of the information in the world becomes digitally available, and more and more of our daily activities are recorded and stored, people are increasingly finding value in analyzing, rather than simply storing and forgetting, these large masses of data. Sadly, advances in I/O performance have lagged the development of commodity processor and memory technology, putting pressure on systems to deliver data fast enough for these types of data-intensive analysis. Active Disks is a system that takes advantage of the processing power on individual disk drives to run application-level code. Moving portions of an application's processing directly to the disk drives can dramatically reduce data traffic and take advantage of the parallelism already present in large storage systems, providing a new point of leverage to overcome the I/O bottleneck.
For more information on Active Disks, please see the Active Disks project page.
FACULTY
Anastassia Ailamaki
Todd Mowry
STUDENTS
Stavros Harizopoulos
Minglong Shao
Stratos Papadomanolakis
Shimin Chen
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.