MapReduce'11, June 8, 2011, San Jose, California, USA
Kai Ren, Julio López, Garth A. Gibson
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Frameworks for large scale data-intensive applications, such as Hadoop and Dryad, have gained tremendous popularity. Understanding the resource requirements of these frame- works and the performance characteristics of distributed ap- plications is inherently dicult. We present an approach, based on resource attribution, that aims at facilitating per- formance analyses of distributed data-intensive applications. This approach is embodied in Otus, a monitoring tool to attribute resource usage to jobs and services in Hadoop clusters. Otus collects and correlates performance metrics from distributed components and provides views that dis- play time-series of these metrics ltered and aggregated us- ing multiple criteria. Our evaluation shows that this ap- proach can be deployed without incurring major overheads. Our experience with Otus in a production cluster suggests its eectiveness at helping users and cluster administrators with application performance analysis and troubleshooting.
KEYWORDS: Resource Attribution, Metrics Correlation, Data-Intensive Systems, Monitoring.
FULL PAPER: pdf