Carnegie Mellon University School of Computer Science Ph.D. Dissertation CMU-CS-23-109. March 2023.
Thomas Kim
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
http://www.pdl.cmu.edu
With the slowing down of Moore’s law, persistent storage hardware has continued to scale at the cost of exposing hardware-level write idiosyncrasies to the software. Thus, a key challenge for systems developers is to reason about and design around these idiosyncrasies to create replicated storage systems that can effectively leverage these new technologies. Two examples of such new and emerging persistent storage technologies are Intel Optane non-volatile main memory and Zoned Namespace (ZNS) solid-state drives. Intel Optane provides persistent byte-addressable storage with throughput and latency rivaling DRAM, but per-DIMM write throughput is significantly lower than read throughput–this imbalance presents challenges in providing high availability in replicated storage systems, due to the severely limited ability to bulk-ingest data. ZNS is a new interface for NVMe-based SSDs that eliminates the flash translation layer, thus preventing garbage collection-related performance degradation and reducing the need for overprovisioned flash hardware. A consequence of these benefits is the loss of overwrite semantics for blocks in a ZNS device, thus necessitating flash-based replicated storage systems to be redesigned for ZNS compatibility.
Based on our experiences and setbacks when designing, implementing, and evaluating systems based on Optane and ZNS, we propose three guidelines to assist developers in designing storage systems on new and emerging persistent storage technologies: (1) systems, even those expected to serve read-heavy workloads, should prioritize optimizing write performance, (2) set and fulfill performance, durability, and fault tolerance guarantees, but do not exceed them as that may result in excessive write overheads, and (3) systems can overcome limitations of write-constrained persistent hardware by optimizing data placement and internal data flows based on assumptions about temporal and spatial locality of the expected client workload.
The first system we present is CANDStore, a highly-available, cost-effective, replicated key-value store that uses Intel Optane for primary storage, and solves the challenge of bottlenecked data ingestion during primary failure recovery through a novel online workload-guided recovery protocol. The second system we present is RAIZN, which is a system that provides RAID-like striping and redundancy for arrays of ZNS SSDs, and solves the various challenges that arise as a result of the lack of overwrite semantics in ZNS. We describe how the above guidelines arose from the setbacks and successes during the development of the above two systems, then apply these guidelines to extend the functionality of RAIZN to create RAIZN+. The final part of this thesis details exactly how we applied these guidelines to achieve near-zero write amplification when serving RocksDB workloads in RAIZN+.
KEYWORDS: Storage systems, replicated storage, distributed storage, persistent memory, nonvolatile main memory, zoned namespaces, ZNS, SSD, flash memory
FULL THESIS: pdf