DATE: Thursday, August 30, 2012
TIME: 4:30 - 5:30 pm
PLACE: ISTC Panther Hollow Room

SPEAKER: Chris Colohan, Google

TITLE: The "Scariest Outage Ever"

ABSTRACT:
On January 25, 2011, somewhere between 15 and 20% of Google's production serving machines lost their ability to exec() new binaries. All graphs showing the global health of Google vanished from internal dashboards, pagers fell silent, and ssh stopped working. Our SREs (site reliability engineers) were flying blind, and due to the paging system outage many had no idea a major outage was in progress. Amazingly, not a single user noticed, as our serving systems were largely unaffected.

This outage was caused by a remarkably complex chain of errors: bugs, process failures, communication failures, system failures, and design flaws. The damage was limited by a resilient system design. This talk will discuss the entire incident, how it happened, and what we learned from it.

BIO:
Chris graduated from Carnegie Mellon in 2005. Since then, he's worked for Google. Chris has worked on the websearch indexing system, MapReduce, sort benchmarking (Chris believes he is the first person to ever sort 1PB of data using MapReduce), and helping develop the systems which manage all of Google's production computers.

 

SDI / ISTC SEMINAR QUESTIONS?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/