DATE: Thursday, November 4, 2004
TIME: Noon - 1 pm
PLACE: Hamerschlag Hall D-210
SPEAKER:
Armando Fox
Stanford University
TITLE:
Recovery as Rapid Adaptation:
Combining Fast Microrecovery with Statistical Monitoring
ABSTRACT:
We began the Recovery-Oriented Computing (ROC) project with the goal of
increasing Internet server availability by reducing time to recovery.
Building on the observation that rebooting or restarting is a well-known
and simple form of recovery that returns systems or subsystems to a"clean slate", we proposed to design systems specifically so that the
only shutdown method is crashing and the only recovery method is fast
reboot; we called this approach crash-only software. Having designed
three crash-only systems, we find that cheap recovery, while indeed good
for its own sake in improving availability, also enables"micro-recovery" as a first line of defense: rather than complex error
unwinding, coerce any observed error to a (micro-)crash, then
(micro-)recover. If micro-recovery is sufficiently cheap in performance
and does not impact correctness, there's no reason to avoid trying it
first, even if it does not always solve the problem. This in turn
enables the use of automated aggressive detection techniques that have
nontrivial false positive rates, or equivalently, to deploy multiple
overlapping detectors/alarms in order to be conservative. Fast cheap
micro-recovery also allows more liberal use of rejuvenation, such as
so-called "rolling reboots", without worrying about when is the "best"
time to do it. We have also found that cheap recovery also allows some
maintenance operations such as incremental scaling of storage to be
recast as failure plus recovery, exploiting the same mechanisms as
recovery to achieve online scaling without service interruption.
In this talk I'll describe highlights and design lessons from three crash-only systems we've built, including experiments using statistical anomaly detection techniques (with nontrivial false positive rates) as a complementary monitoring strategy. I'll also discuss how this approach might provide a scientific basis for designing tolerant applications in the face of imperfect detection and localization techniques.
More at http://crash.stanford.edu and http://swig.stanford.edu/public/projects/roc/
BIO:
Armando Fox (fox@cs.stanford.edu) has been an Assistant Professor at
Stanford since January 1999. He has focused on improving system
dependability through fast recovery, and was listed among the "Scientific American 50" of 2003 for his work in that area. Prof. Fox
has also received teaching awards from the Associated Students of
Stanford University Teaching, Tau Beta Pi, and the Society of Women
Engineers. His other degrees in EECS are from MIT and the University of
Illinois.
Host: David Garlan
SDI / LCS Seminar Questions?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/