USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Soila Pertet, Rajeev Gandhi and Priya Narasimhan
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
Replicated systems are often hosted over underlying
group communication protocols that provide totally ordered,
reliable delivery of messages. In the face of a
performance problem at a single node, these protocols
can cause correlated performance degradations at even
non-faulty nodes, leading to potential red herrings in failure
diagnosis. We propose a fingerpointing approach that
combines node-level (local) anomaly detection, followed
by system-wide (global) fingerpointing. The local anomaly
detection relies on threshold-based analyses of system
metrics, while global fingerpointing is based on the
hypothesis that the root-cause of the failure is the node
with an “odd-man-out” view of the anomalies. We compare
the results of applying three classifiers – a heuristic
algorithm, an unsupervised learner (k-means clustering),
and a supervised learner (k-nearest-neighbor) – to fingerpoint
the faulty node.