Invited contribution to the Workshop on Reliability Analysis of System Failure Data (RAF'07) MSR Cambridge, UK, March 2007.
Bianca Schroeder, Garth A. Gibson
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
System reliability is a major challenge in system design. Unreliable systems are not only major source of user frustration, they are also expensive. Avoiding downtime and the cost of actual downtime make up more than 40% of the total cost of ownership for modern IT systems. Unfortunately, with the large component count in today’s large-scale systems, failures are quickly becoming the norm rather than the exception.
This submission describes an effort currently underway at CMU to create a public Computer Failure Data Repository (CFDR), sponsored by USENIX. The goal of the repository is to accelerate research on system reliability by filling the nearly empty collection of public data with detailed failure data from a variety of large production systems. We give a brief overview of the data sets we have collected so far, and discuss our ongoing efforts and the long-term goals of the CFDR.
FULL PAPER: pdf