Database systems are among the most important application domains for 
                        storage systems. Currently, we are approaching database research on 
                        several fronts, with a common goal of identifying innovative ways to 
                        improve system performance by better matching database systems to the 
                        underlying hardware.
                        
                    
Traditional database system architectures face a rapidly evolving operating 
  environment, where millions of users store and access terabytes of data. 
  In order to cope with increasing demands for performance high-end DBMS 
  employ parallel processing techniques coupled with a plethora of sophisticated 
  features. However, the widely adopted, work-centric, thread-parallel 
  execution model entails several shortcomings that limit server performance 
  when executing workloads with changing requirements. Moreover, the monolithic 
  approach in DBMS software has led to complex and difficult to extend 
  designs. This project introduces a staged design for high-performance, 
  evolvable DBMS that is easy to tune and maintain. We propose to break 
  the database system into modules and to encapsulate them into self-contained 
  stages connected to each other through queues. The staged, data-centric 
  design remedies the weaknesses of modern DBMS by providing solutions 
  at both a hardware and a software engineering level. 
  
  | 
  
| The Staged Database System design: Each stage has its own queue and thread support. New queries queue up in the first stage, they are encapsulated into a "packet", and pass through the five stages shown on the top of the figure. A packet carries the querys "backpack:" its state and private data. Inside the execution engine a query can issue multiple packets to increase parallelism. | 
When optimizing cache utilization, data placement is extremely important. 
  In commercial DBMSs, data misses in the cache hierarchy are a key memory 
  bottleneck. The traditional data placement scheme used in DBMSs, the 
  N-ary Storage Model (NSM), stores records contiguously starting from 
  the beginning of each disk page, and uses an offset (slot) table at 
  the end of the page to locate the beginning of each record and offers 
  high intra-record locality. Recent research, however, indicates that 
  cache utilization and performance is becoming increasingly important 
  on modern platforms. This project first demonstrates that in-page data 
  placement is the key to high cache performance and that NSM exhibits 
  low cache utilization on modern platforms. Therefore, we propose a new 
  data organization model called PAX (Partition Attributes Across), that 
  significantly improves cache performance by grouping together all values 
  of each attribute within each page. Because PAX only affects layout 
  inside the pages, it incurs no storage penalty and does not affect I/O 
  behavior. When compared to NSM, our experimental results indicate that 
  (a) PAX exhibits superior cache and memory bandwidth utilization, saving 
  at least 75% of NSM's stall time due to data cache accesses, (b) range 
  selection queries and updates on memory-resident relations execute 17-25% 
  faster, and (c) TPC-H queries involving I/O execute 11-48% faster. We 
  have also found that PAX performs well across different memory system 
  designs.
  
  | 
  
| PAX/NSM speedup for DSS queries: Shows PAX/NSM speedups when running range selections and four TPC-H queries against a 100, 200, and 500-MB TPC-H database on top of the Shore storage manager. Decision-support systems are especially processor- and memory-bound, and PAX outperforms NSM for all these experiments. The speedups obtained are not constant across the experiments due to a combination of differing amounts of I/O and interactions between the hardware and the algorithms being used. | 
We are exploring the latency hiding approaches, cache prefetching and I/O prefetching, to improve database performance. We have used cache prefetching to improve main memory B+tree performance (SIGMOD'01), studied fractal prefetching B+trees to optimize both disk and cache performance with a single index structure (SIGMOD'02), and improved hash join performance with prefetching (submitted). We are now focusing on prediction schemes using history information to improve general join algorithms.
Fractal Prefetching B+-Trees (fpB+-Trees) are a type of B+-Tree that optimize both cache and I/O performance by embedding "cache-optimized" trees within "disk-optimized" trees. This improves CPU cache performance in traditional B+-Trees for indexing disk resident data and I/O performance in B+-Trees optimized for cache. At a coarse granularity an fpB+-Tree contains disk-optimized nodes that are roughly the size of a disk page; at a fine granularity, it contains cache-optimized nodes that are roughly the size of a cache line. The fpB+-Tree is referred to as "fractal" because of its self-similar "tree within a tree" structure.
Recently, researchers have proposed new types of B+-Trees optimized 
  for CPU cache performance in main memory environments, where the tree 
  node sizes are one or a few cache lines. Unfortunately, due primarily 
  to this large discrepancy in optimal node sizes, existing disk-optimized 
  B+-Trees suffer from poor cache performance while cache-optimized B+-Trees 
  exhibit poor disk performance. We propose fractal prefetching B+-Trees 
  (fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized" 
  trees, in order to optimize both cache and I/O performance. We design 
  and evaluate two approaches to breaking disk pages into cache-optimized 
  nodes: disk-first and cache-first. These approaches are somewhat biased 
  in favor of maximizing disk and cache performance, respectively, as 
  demonstrated by our results. Both implementations of fpB+-Trees achieve 
  dramatically better cache performance than disk-optimized B+-Trees: 
  a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement 
  for range scans, and up to a 20-fold improvement for updates, all without 
  significant degradation of I/O performance. In addition, fpB+-Trees 
  accelerate I/O performance for range scans by using jump-pointer arrays 
  to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's 
  DB2 Universal Database.
  
  | 
  
Self-similar "tree within a tree" structure.  | 
  
Today's commodity disk drives, the basic unit of storage for computer systems large and small, are actually small computers, with a processor, memory, and 'network' connection, along with the spinning magnetic material that permanently stores the data. As more and more of the information in the world becomes digitally available, and more and more of our daily activities are recorded and stored, people are increasingly finding value in analyzing, rather than simply storing and forgetting, these large masses of data. Sadly, advances in I/O performance have lagged the development of commodity processor and memory technology, putting pressure on systems to deliver data fast enough for these types of data-intensive analysis. Active Disks is a system that takes advantage of the processing power on individual disk drives to run application-level code. Moving portions of an application's processing directly to the disk drives can dramatically reduce data traffic and take advantage of the parallelism already present in large storage systems, providing a new point of leverage to overcome the I/O bottleneck.
For more information on Active Disks, please see the Active Disks project page.
FACULTY
Anastassia Ailamaki
    Todd Mowry
STUDENTS
Stavros Harizopoulos
  Minglong Shao
  Stratos Papadomanolakis
  Shimin Chen 
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg LP, Datadog, Google, Intel Corporation, Jane Street, LayerZero Research, Meta, Microsoft Research, Oracle Corporation, Oracle Cloud Infrastructure, Pure Storage, Salesforce, Samsung Semiconductor Inc., and Western Digital for their interest, insights, feedback, and support.