Parallel Data Laboratory

Talks by Recent PDL Departures - Summer 2022

Speakers

SPEAKER	TALK/ABSTRACT
Benjamin Berg, Assistant Professor, University of North Carolina	A New Methodology for Parallel Job Scheduling
Huaicheng Li, Assistant Professor, Virginia Tech	Towards Predictable and Efficient Datacenter Storage
Lin Ma, Assistant Professor, University of Michigan	Putting Your Database on Autopilot: Self-driving Database Management Systems

SPEAKER: Benjamin Berg, Assistant Professor, University of North Carolina

A New Methodology for Parallel Job Scheduling - Talk Video
Modern computer systems allow resources to be dynamically allocated to parallelizable jobs. When a job is parallelized across many servers or cores, it will complete more quickly. However, jobs typically receive diminishing returns from being allocated additional resources. Hence, given a fixed number of cores, it is not obvious how to dynamically allocate cores to a stream of incoming jobs in order to minimize the overall mean response time across jobs.

For example, an optimal allocation policy must favor shorter jobs, but favoring any single job too heavily can cause the system to operate very inefficiently. Additionally, an optimal policy must decide how much, if at all, to favor more parallelizable jobs over less parallelizable jobs. In a variety of settings, we show how to derive an optimal allocation policy which minimizes mean response time. We then show that policies inspired by our theoretical results can be implemented in a modern database to reduce mean response time by a factor of 2.

BIO: I am a recently graduated Ph.D. student in the Computer Science Department at Carnegie Mellon University, where I was advised by Mor Harchol-Balter. I was a recipient of the Facebook Graduate Fellowship. I am thrilled to announce that I will be starting as an Assistant Professor in the Computer Science Department at the University of North Carolina at Chapel Hill in the Fall of 2022. I plan to continue to root for the Duke Blue Devils (my alma mater), and to troll my students whenever possible.

SPEAKER: Huaicheng Li, Assistant Professor, Virginia Tech

Towards Predictable and Efficient Datacenter Storage - Talk Video
The increasing complexity in storage software and hardware brings new challenges to achieve predictable performance and efficiency. On the one hand, emerging hardware break long-held system design principles and are held back by aged and inflexible system interfaces and usage models, requiring radical rethinking on the software stack to leverage new hardware capabilities for optimal performance. On the other hand, the computing landscape is becoming increasingly heterogeneous and complex, demanding explicit systems-level support to manage hardware-associated complexity and idiosyncrasy, which is unfortunately still largely missing.

In this talk, I will discuss my efforts to build low-latency and cost-efficient datacenter storage systems. By revisiting existing storage interface/abstraction designs and software/hardware responsibility divisions, I will present holistic storage stack designs for cloud datacenters, which deliver orders of magnitude of latency improvement and significantly improved cost-efficiency. Speaker Bio: Huaicheng is a postdoc at CMU in the Parallel Data Lab (PDL). He received his Ph.D. from University of Chicago. His interests are mainly in Operating Systems and Storage Systems, with a focus on building high-performance and cost-efficient storage infrastructure for datacenters. His research has been recognized by two best paper nominations at FAST (2017 and 2018) and has also made real impact, with production deployment in datacenters, code integration to Linux, and a storage research platform widely used by the research community.

BIO: I am an Assistant Professor in the Computer Science department at Virginia Tech (since Fall 2022). My group focuses on fundamental Computer Systems research in the areas of Operating Systems, Storage Systems, Memory Systems, and Systems Architecture. We analyze/benchmark, hack, design, and build systems to explore better systems support for modern/emerging {compute,storage,memory} x {hardware,interfaces,applications} for improved {performance,resource efficiency,programmability}. Prior to this, I was a postdoc at CMU in the Parallel Data Lab (PDL). I received my Ph.D. from University of Chicago. My interests are mainly in Operating Systems and Storage Systems, with a focus on building high-performance and cost-efficient storage infrastructure for datacenters. My research has been recognized by two best paper nominations at FAST (2017 and 2018) and has also made real impact, with production deployment in datacenters, code integration to Linux, and a storage research platform widely used by the research community.

SPEAKER: Lin Ma, Assistant Professor, University of Michigan

Putting Your Database on Autopilot: Self-driving Database Management Systems - Talk Video
Database management systems (DBMSs) are essential for modern data-driven applications. However, they are notoriously difficult to deploy and administer because they have many aspects that one can change that affect their performance, including database physical design and system configuration. There are existing methods that recommend how to change these aspects of databases for an application. But most of them still require humans to make final decisions on what changes to apply and when to apply them. Thus, DBMS administrations today remain onerous and costly.

In this talk, I present a self-driving DBMS architecture that enables automatic system management and removes the administration impediments. Our approach consists of three frameworks inspired by self-driving car architectures: (1) workload forecasting, (2) behavior modeling, and (3) action planning. The workload forecasting framework predicts the query arrival rates under varying database workload patterns using an ensemble of time-series forecasting models. The behavior modeling framework constructs fine-grained machine learning models that predict the runtime behavior of the DBMS. Lastly, the action planning framework generates a sequence of optimization actions based on these forecasted workload patterns and behavior model estimations. It uses receding horizon control and Monte Carlo tree search to effectively approximate the complex optimization problem.

Our forecasting-modeling-planning architecture enables an autonomous DBMS that proactively plans for optimization actions without expensive testing. It automatically applies the actions at proper times, holistically controls all system aspects, and provides explanations on its decisions.

BIO: I am an incoming assistant professor in the Computer Science and Engineering Division at the University of Michigan, Ann Arbor, starting Fall 2023. I am currently working on the Delta Lake/Lakehouse at Databricks. I graduated from Carnegie Mellon University (CMU) with a PhD in Computer Science, fortunately advised by Andy Pavlo. My research interests lie in the intersection of database management systems (DBMSs) and machine learning (ML), especially using ML/AI techniques to automate database administration/tuning to remove human impediments. My PhD research focused on the architecture design of autonomous DBMSs, implemented in an in-memory relational DBMS built from CMU. I finished my Bachelor’s degree in Computer Science and Technology at Peking University, where I worked on data and information management with Prof. Bin Cui.

CONTACTS

, PDL Co-Director
RMCIC 2311

, PDL Co-Director
(412) 268-3064
GHC 9109

Executive Director, Parallel Data Lab
VOICE: (412) 268-5485

PDL Administrative Manager
VOICE: (412) 268-6716