Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium (NOMS) 2010, Osaka, Japan, Apr 2010.
Jiaqi Tan, Xinghao Pan, DSO National Laboratories, Singapore
Eugene Marinelli, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan, Carnegie Mellon Univ., USA
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
We present Kahuna, an approach that aims to diagnose performance problems in MapReduce systems. Central to Kahuna’s approach is our insight on peer-similarity, that nodes behave alike in the absence of performance problems, and that a node that behaves differently is the likely culprit of a performance problem. We present applications of Kahuna’s insight in techniques and their algorithms to statistically compare blackbox (OS-level performance metrics) and white-box (Hadooplog statistics) data across the different nodes of a MapReduce cluster, in order to identify the faulty node(s). We also present empirical evidence of our peer-similarity observations from the 4000-processor Yahoo! M45 Hadoop cluster. In addition, we demonstrate Kahuna’s effectiveness through experimental evaluation of two algorithms for a number of reported performance problems, on four different workloads in a 100-node Hadoop cluster running on Amazon’s EC2 infrastructure.
FULL PAPER: pdf