Parallel Data Laboratory

PDL Abstract

Data Caching for Enterprise-Grade Petabyte-Scale OLAP

2024 USENIX Annual Technical Conference. July 10–12, 2024 • Santa Clara, CA, USA.

Chunxu Tang¹, Bin Fan¹, Jing Zhao², Chen Liang², Yi Wang¹, Beinan Wang¹, Ziyue Qiu^3,2, Lu Qiu¹, Bowen Ding¹, Shouzhuo Sun¹, Saiguang Che¹, Jiaming Mai¹, Shouwei Chen¹, Yu Zhu¹, Jianjian Xie¹, Yutian (James) Sun⁴, Yao Li², Yangjun Zhang², Ke Wang⁴, and Mingmin Chen²

¹Alluxio, Inc.,
²Uber, Inc.,
³Carnegie Mellon University,
⁴Meta, Inc.

http://www.pdl.cmu.edu/

With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale, enterprisegrade workloads over three years of deployment at Uber and Meta. We share insights and operational experiences in implementing these optimizations, providing valuable perspectives on managing modern, massive-scale OLAP workloads.

FULL PAPER: pdf

PARALLEL DATA LAB

PDL Publications

PDL Abstract

Data Caching for Enterprise-Grade Petabyte-Scale OLAP

Contact us

Recent Events

PDL Retreat 2024

PDL Retreat 2023

PDL Retreat 2022

Social Media