Practical Experiences with Chronics Discovery in Large Telecommunications Systems
Workshop on System Logs and the Application of Machine Learning Techniques (SLAML), Cascais, Portugal, October 2011.
Soila P. Kavulya, Kaustubh Joshi*, Matti Hiltunen*, Scott Daniels*, Rajeev Gandhi, Priya Narasimhan
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
*AT&T Labs, Research
Chronics are recurrent problems that fly under the radar of operations teams because they do not perturb the system enough to set off alarms or violate service-level objectives. The discovery and diagnosis of never-before seen chronics poses new challenges as they are not detected by traditional threshold-based techniques, and many chronics can be present in a system at once, all starting and ending at different times. In this paper, we describe our experiences diagnosing chronics using server logs on a large telecommunications service. Our technique uses a scalable Bayesian distribution learner coupled with an information theoretic measure of distance (KL divergence), to identify the attributes that best distinguish failed calls from successful calls. Our preliminary results demonstrate the usefulness of our technique by providing examples of actual instances where we helped operators discover and diagnose chronics.
FULL PAPER: pdf