Carnegie Mellon University Parallel Data Lab Ph.D. Dissertation. CMU-PDL-13-105, May 2013.
Raja R. Sambasivan
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Diagnosing performance problems in modern datacenters and distributed systems is challenging, as the root cause could be contained in any one of the system's numerous components or, worse, could be a result of interactions among them. As distributed systems continue to increase in complexity, diagnosis tasks will only become more challenging. There is a need for a new class of diagnosis techniques capable of helping developers address problems in these distributed environments.
As a step toward satisfying this need, this dissertation proposes a novel technique, called request-flow comparison, for automatically localizing the sources of performance changes from the myriad potential culprits in a distributed system to just a few potential ones. Request-flow comparison works by contrasting the workflow of how individual requests are serviced within and among every component of the distributed system between two periods: a non-problem period and a problem period. By identifying and ranking performance-affecting changes, request-flow comparison provides developers with promising starting points for their diagnosis efforts. Request workflows are obtained with less than 1% overhead via use of recently developed end-to-end tracing techniques.
To demonstrate the utility of request-flow comparison in various distributed systems, this dissertation describes its implementation in a tool called Spectroscope and describes how Spectroscope was used to diagnose real, previously unsolved problems in the Ursa Minor distributed storage service and in select Google services. It also explores request-flow comparison's applicability to the Hadoop File System. Via a 26-person user study, it identies effective visualizations for presenting request-flow comparison's results and further demonstrates that request-flow comparison helps developers quickly identify starting points for diagnosis. This dissertation also distills design choices that will maximize an end-to-end tracing infrastructure's utility for diagnosis tasks and other use cases.
KEYWORDS: distributed systems, performance diagnosis, request-flow comparison
FULL DISSERTATION: pdf