Parallel Data Laboratory

PDL Talk Series

August 7, 2024

TIME: 12:00 noon - to approximately 1:00 pm EDT
PLACE: Virtual - a zoom link will be emailed closer to the seminar

SPEAKER: Suhas Jayaram Subramanya
PhD Student, Carnegie Mellon

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
Large GPU clusters are increasingly becoming more heterogeneous due to advances in GPU design and incremental deployment of a mix of GPU types over time. Deep learning (DL) training jobs running on these GPU clusters can see varying job completion times depending on the resources allocated by the cluster scheduler and job hyper-parameters configured by users at the time of job submission. Sia is a cluster scheduler that (1) efficiently assigns heterogeneous GPU resources to elastic resource-adaptive DL training jobs, and (2) configures the job hyper-parameters to maintain high training efficiency for all running jobs without sacrificing the quality of trained models.

We will discuss challenges in optimizing resource-adaptivity for deep learning training (DLT) jobs on large clusters with many GPU types, and introduce a new scheduling formulation that efficiently matches DLT jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. On job traces derived from real datacenters, Sia improves job completion times by 30-93% while using 12-60% fewer GPU hours. Furthermore, its scheduling policy is quick to evaluate and easily scales to GPU clusters with many GPU types and 1000s of GPUs.

BIO: Suhas is a final-year PhD student in the CS Department, advised by Prof. Greg Ganger. His primary research area is deep learning systems.

CONTACTS

Director, Parallel Data Lab
VOICE: (412) 268-1297

Executive Director, Parallel Data Lab
VOICE: (412) 268-5485

PDL Administrative Manager
VOICE: (412) 268-6716