IEEE/IFIP Conference on Dependable Systems and Networks (DSN), June 2012.
Soila P. Kavulya, Scott Daniels (AT&T), Kautubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi, Priya Narasimhan
Parallel Data Laboratory
Carnegie Mellon University
Pittsburgh, PA 15213
Chronics are recurrent problems that often fly under the radar of operations teams because they do not affect enough users or service invocations to set off alarm thresholds. In contrast with major outages that are rare, often have a single cause, and as a result are relatively easy to detect and diagnose quickly, chronic problems are elusive because they are often triggered by complex conditions, persist in a system for days or weeks, and coexist with other problems active at the same time. In this paper, we present Draco, a scalable engine to diagnose chronics that addresses these issues by using a "topdown" approach that starts by heuristically identifying user interactions that are likely to have failed, e:g:, dropped calls, and drills down to identify groups of properties that best explain the difference between failed and successful interactions by using a scalable Bayesian learner. We have deployed Draco in production for the VoIP operations of a major ISP. In addition to providing examples of chronics that Draco has helped identify, we show via a comprehensive evaluation on production data that Draco provided 97% coverage, had fewer than 4% false positives, and outperformed state-of-the-art diagnostic techniques by up to 56% for complex chronics.
FULL PAPER: pdf