Aaron Harlap, Gregory R. Ganger, Phillip B. Gibbons
Carnegie Mellon University
The TierMLparameter server system for machine learning (ML) enables aggressive exploitation of transient revocable resources to complete model training cheaper and/or faster. Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time (e.g., the Amazon EC2 spot market often provides such resources at a 90% discount). Unlike other parameter server systems, TierMLexploits such transient resources, using minimal non-transient resources to efficiently adapt to bulk additions and revocations of transient machines. Our evaluations show that TierMLreduces cost by ≈75% relative to non-transient pricing and by 46%-50% relative to using transient resources with checkpointing to address bulk changes, while nearly matching or decreasing running times.
KEYWORDS: Big Data infrastructure, Big Learning systems
FULL TR: pdf