HotDep '10. October 3, 2010, Vancouver, BC, Canada.
Michael P. Kasick, Rajeev Gandhi, Priya Narasimhan
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation — CPU instruction- pointer samples and function-call traces—to localize the faulty server and to enable root-cause analysis of the resource at fault. We validate our approach by injecting realistic storage and network problems into three different workloads (dd, IO-zone, and PostMark) on a PVFS cluster.
FULL PAPER: pdf