Aurick Qiao1,2, Abutalib Aghayev2, Weiren Yu1,3, Haoyang Chen1, Qirong Ho1, Garth A. Gibson2,4,
Eric P. Xing1,2
1 Petuum, Inc.
2 Carnegie Mellon University
3 Beihang University
4 Vector Institute
Machine Learning (ML) is an increasingly popular application in the cloud and data-center, inspiring new algorithmic and systems techniques that leverage unique properties of ML applications to improve their distributed performance by orders of magnitude. However, applications built using these techniques tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of multi-tenant environments. Existing distributed frameworks are either in-elastic, or offer programming models which are incompatible with the techniques employed by high-performance ML applications.
Motivated by these trends, we present Litz, an elastic framework supporting distributed ML applications. We categorize the wide variety of techniques employed by these applications into three general themes — stateful workers, model scheduling, and relaxed consistency — which are collectively supported by Litz’s programming model. Our implementation of Litz’s execution system transparently enables elasticity and low-overhead execution.
We implement several popular ML applications using Litz, and show that they can scale in and out quickly to adapt to changing resource availability, as well as how a scheduler can leverage elasticity for faster job completion and more efficient resource allocation. Lastly, we show that Litz enables elasticity without compromising performance, achieving competitive performance with state-of-the-art non-elastic ML frameworks.
FULL TR: pdf