Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-11-109, April 2011.
Soila P. Kavulya†, Kaustubh Joshi§, Matti Hiltunen§, Scott Daniels§,
Rajeev Gandhi†,
Priya Narasimhan†
†Carnegie Mellon University
§AT&T Labs - Research
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Large scale integrated services such as VoIP running over IP networks are the future of telecommunications. The high availability requirements of such services require scalable techniques for rapid diagnosis and localization of user-visible failures. However, state-of-the-art network event correlation techniques often produce alarms that cannot easily be correlated to customer visible impacts because they work in a "bottom-up" fashion starting from device-level events and working upwards. In this paper, we develop a contrasting "top-down" approach to problem diagnosis that starts from user visible defects such as call drops and works downwards by identifying the network level elements that are the most suggestive of the defects. Our prototype, called Draco, uses statistical comparisons between good and bad system behavior to identify the underlying causes of problems without the need for any expert-provided rules or models, and without any prior training. This allows Draco to localize the causes of problems that have never been seen before. We have deployed Draco at scale for a portion of the VoIP operations of a major ISP. We demonstrate Draco's usefulness by provide examples of actual instances in which Draco helped operators diagnose service issues.
KEYWORDS: diagnosis, distributed systems, scalable, VoIP networks
FULL TR: pdf