Elie Krevat*, Tomer Shiran*, Eric Anderson†, Joseph Tucek†, Jay J. Wylie†, Gregory R. Ganger*
*Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
†HP Labs
New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the theoretic ideal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3–13× longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results. Using a simplified dataflow processing tool called Parallel DataSeries, we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), coming within 10–14% of the model's prediction. Moreover, guided by the model, we present analysis of inefficiencies which exposes issues in both the disk and networking subsystems that will be faced by any DISC system built atop standard OS and networking services.
KEYWORDS: data-intensive computing, cloud computing, performance and efficiency
FULL TR: pdf