USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA (April 2007).
Soila Pertet, Rajeev Gandhi and Priya Narasimhan
Parallel Data Laboratory
                      Carnegie Mellon University
                      Pittsburgh, PA 15213
http://www.pdl.cmu.edu/
                      
                      Replicated systems are often hosted over underlying
                      group communication protocols that provide totally ordered,
                      reliable delivery of messages. In the face of a
                      performance problem at a single node, these protocols
                      can cause correlated performance degradations at even
                      non-faulty nodes, leading to potential red herrings in failure
                      diagnosis. We propose a fingerpointing approach that
                      combines node-level (local) anomaly detection, followed
                      by system-wide (global) fingerpointing. The local anomaly
                      detection relies on threshold-based analyses of system
                      metrics, while global fingerpointing is based on the
                      hypothesis that the root-cause of the failure is the node
                      with an “odd-man-out” view of the anomalies. We compare
                      the results of applying three classifiers – a heuristic
                      algorithm, an unsupervised learner (k-means clustering),
                      and a supervised learner (k-nearest-neighbor) – to fingerpoint 
                      the faulty node.
FULL PAPER: pdf