TierML: Using Tiers of Reliability for Agile Elasticity in Machine Learning

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-16-102. May 2016.

Aaron Harlap, Gregory R. Ganger, Phillip B. Gibbons

Carnegie Mellon University


The TierMLparameter server system for machine learning (ML) enables aggressive exploitation of transient revocable resources to complete model training cheaper and/or faster. Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time (e.g., the Amazon EC2 spot market often provides such resources at a 90% discount). Unlike other parameter server systems, TierMLexploits such transient resources, using minimal non-transient resources to efficiently adapt to bulk additions and revocations of transient machines. Our evaluations show that TierMLreduces cost by ≈75% relative to non-transient pricing and by 46%-50% relative to using transient resources with checkpointing to address bulk changes, while nearly matching or decreasing running times.

KEYWORDS: Big Data infrastructure, Big Learning systems

FULL TR: pdf




© 2017. Last updated 18 May, 2016