Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-107. July 2010. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-103.
Raja R. Sambasivan, Alice X. Zheng†, Elie Krevat, Spencer Whitman, Michael Stroucken,
William Wang, Lianghong Xu, Gregory R. Ganger
Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
†Microsoft Research
The causes of performance changes in a distributed system often elude even its developers. This paper develops a new technique for gaining insight into such changes: comparing system behaviours from two executions (e.g., of two system versions or time periods). Building on end-to-end request flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request processing. The implementation of these algorithms in a tool called Spectroscope is described and evaluated. Five case studies are presented of using Spectroscope to diagnose performance hanges in a distributed storage system caused by code changes and configuration modifications, demonstrating the value and efficacy of comparing system behaviours.
KEYWORDS:browsing & visualizing system behaviour, comparing system behaviours, end-to-end tracing,
performance debugging, performance problemdiagnosis, response-time mutations, request-flow graphs,
statistical hypothesis testing, structural mutations, structural performance problems
FULL TR: pdf