Chapter in "Resilience Assessment and Evaluation". Editors. Katinka Wolter, Alberto Avritzer, Marco Vieira, Aad van Moorsel. Springer Verlag, December 2012.
Soila P. Kavulya, Kaustubh Joshi (AT&T), Felicita Di Giandomenico (ISTI-CNR, Pisa, Italy), Priya Narasimhan
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Failure diagnosis is the process of identifying the causes of impairment in a system's function based on observable symptoms, i.e., determining which fault led to an observed failure. Since multiple faults can often lead to very similar symptoms, failure diagnosis is often the rst line of defense when things go wrong - a prerequisite before any correc- tive actions can be undertaken. The results of diagnosis also provide data about a system's operational fault prole for use in oine resilience eval- uation. While diagnosis has historically been a largely manual process requiring signicant human input, techniques to automate as much of the process as possible have signicantly grown in importance in many industries including telecommunications, internet services, automotive systems, and aerospace. This chapter presents a survey of automated failure diagnosis techniques including both model-based and model-free approaches. Industrial applications of these techniques in the above do- mains are presented, and nally, future trends and open challenges in the eld are discussed.
FULL CHAPTER: pdf