20th IEEE International Symposium on Software Reliability Engineering (ISSRE), Industrial Track, Mysuru, India, Nov 2009.
Xinghao Pan*, Jiaqi Tan*, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan
Parallel Data Laboratory
School of Computer Science & Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
*DSO National Laboratories Singapore
Google’s MapReduce framework enables distributed, data-intensive, parallel applications by decomposing a massive job into smaller (Map and Reduce) tasks and a massive data-set into smaller partitions, such that each task processes a different partition in parallel. However, performance problems in a distributed MapReduce system can be hard to diagnose and to localize to a specific node or a set of nodes. On the other hand, the structure of large number of nodes performing similar tasks naturally affords us opportunities for observing the system from multiple viewpoints.
We present a “Blind Men and the Elephant” (Blimey) framework in which we exploit this structure, and demonstrate how problems in a MapReduce system can be diagnosed by corroborating the multiple viewpoints. More specifically, we present algorithms within the Blimey framework based on OS-level performance counters, on white-box metrics extracted from logs, and on application-level heartbeats. We show that our Blimey algorithms are able to capture a variety of faults including resource hogs and application hangs, and to localize the fault to subsets of slave nodes in the MapReduce system. In addition, we discuss how the diagnostic algorithms’ outcomes can be further synthesized in a repeated application of the Blimey approach. We present a simple supervised learning technique which allows us to identify a fault if it has been previously observed.
FULL TR: pdf