HeART (Heterogeneity-Aware Redundancy Tuner) is an online tool for guiding exploitation of reliability heterogeneity among disks to reduce the space overhead (and hence the cost) of data reliability. HeART uses failure data observed over time to empirically quantify each disk group’s reliability characteristics and determine minimum-capacity redundancy settings that achieve specified target data reliability levels. The overall HeART project is exploring potential overall space savings, on-line approaches to AFR and change point determination, data placement and redistribution schemes for minimizing and bounding performance overheads, and best-practice integration into existing distributed storage systems (e.g., HDFS).
Large cluster storage systems almost always include a heterogeneous mix of storage devices, even when using devices that are all of the same technology type. Commonly, this heterogeneity arises from incremental deployment and per-acquisition optimization of the makes/models acquired. As a result, a given cluster storage system can easily include several makes/models, each in substantial quantity. Different makes/models can have substantially different reliabilities, in addition to the well-known differences in capacity and performance. For example, Fig. 2 shows the average annualized failure rates (AFRs) during the useful life (stable operation period) of the 6 HDD makes/models that make up more than 90% of the cluster storage system used for the Backblaze backup service [1]. The highest failure rate is over 3.5X greater than the lowest, and no two are the same. Another recent study has shown that different Flash SSD makes/models similarly exhibit substantial failure rate differences.
Despite such differences, cluster storage redundancy is generally configured as if all of the devices have the same reliability. Unfortunately, this approach leads to configurations that are overly resource-consuming and overly risky. For example, if redundancy settings are configured to achieve a given data reliability target (e.g., a specific mean time to data loss (MTTDL)) based on the highest annualized failure rate (AFR) of any device make/model of any allowed age, then too much space will be used for redundancy associated with data that is stored fully on lower AFR makes/models. If redundancy settings for all data are based on lower AFRs, on the other hand, then data stored fully on higher-AFR devices is not sufficiently protected to achieve the data reliability target. By robustly estimating per-disk-group AFRs and selecting the best redundancy settings for each, HeART enables more cost-effective data reliability for cluster storage systems and avoids the space inefficiency of one-size-fits-all redundancy schemes offering large potential cost savings.
FACULTY
GRAD STUDENTS
Sai Kiriti Badam
Jiaan Dai
Saurabh Kadekodi
Francisco Maturana
Juncheng (Jason) Yang
Jiongtao Ye
Xuren Zhou
Jiaqi Zuo
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.