PARALLEL DATA LAB 

PDL Abstract

Applying Idealized Lower-Bound Runtime Models to
Understand Inefficiencies in Data-Intensive Computing
(Extended Abstract)

SIGMETRICS'11, June 7–11, 2011, San Jose, California, USA.

Elie Krevat*, Tomer Shiran*, Eric Anderson†, Joseph Tucek†, Jay J. Wylie†, Gregory R. Ganger*

*Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213

†HP Labs

http://www.pdl.cmu.edu/

"Data-intensive scalable computing" (DISC) refers to a rapidly growing style of computing characterized by its reliance on large and expanding datasets [3]. Driven by the desire and capability to extract insight from such datasets, DISC is quickly emerging as a major activity of many organizations. Map-reduce style programming frameworks such as MapReduce [4] and Hadoop [1] support DISC activities by providing abstractions and frameworks to more easily scale data-parallel computations over commodity machines.

In the pursuit of scale, popular map-reduce frameworks neglect efficiency as an important metric. Anecdotal experiences indicate that they neither achieve balance nor full goodput of hardware resources, effectively wasting a large fraction of the computers over which jobs are scaled. If these inefficiencies are real, the same work could be completed at much lower costs. An ideal run would provide maximum scalability for a given computation without wasting resources. Given the widespread use and scale of DISC systems, it is important that we move closer to frameworks that are "hardwareefficient," where the framework provides sufficient parallelism to keep the bottleneck resource fully utilized and makes good use of all I/O components. [...more]

EXTENDED ABSTRACT: pdf