SMPFRAME: A Distributed Framework for Scheduled Model Parallel Machine Learning

Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-15-103. April 2015.

Jin Kyu Kim, Qirong Ho*, Seunghak Lee Xun Zheng, Wei Dai, Garth Gibson, Eric Xing

Carnegie Mellon University,
* Institute for Infocomm Research A*STAR


Machine learning (ML) problems commonly applied to big data by existing distributed systems share and update all ML model parameters at each machine using a partition of data—a strategy known as data-parallel. An alternative and complimentary strategy, model-parallel, partitions model parameters for non-shared parallel access and update, periodically repartitioning to facilitate communication. Model-parallelism is motivated by two challenges that data-parallelism does not usually address: (1) parameters may be dependent, thus naive concurrent updates can introduce errors that slow convergence or even cause algorithm failure; (2) model parameters converge at different rates, thus a small subset of parameters can bottleneck ML algorithm completion. We propose scheduled model parallellism (SMP), a programming approach where selection of parameters to be updated (the schedule) is explicitly separated from parameter update logic. The schedule can improve ML algorithm convergence speed by planning for parameter dependencies and uneven convergence. To support SMP at scale, we develop an archetype software framework SMPFRAME which optimizes the throughput of SMP programs, and benchmark four common ML applications written as SMP programs: LDA topic modeling, matrix factorization, sparse least-squares (Lasso) regression and sparse logistic regression. By improving ML progress per iteration through SMP programming whilst improving iteration throughput through SMPFRAME we show that SMP programs running on SMPFRAME outperform non-model-parallel ML implementations: for example, SMP LDA and SMP Lasso respectively achieve 10x and 5x faster convergence than recent, well-established baselines.

KEYWORDS: Big Data infrastructure, Big Machine Learning systems, Model-Parallel Machine Learning

FULL TR: pdf




© 2017. Last updated 3 May, 2015