Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102. April 2014.
Raja R. Sambasivan, Rodrigo Fonseca^, Ilari Shafer*, Gregory R. Ganger
Carnegie Mellon University,
^Brown University
*Microsoft
End-to-end tracing captures the workflowof causally-related activity (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for management tasks like diagnosis and resource accounting. Drawing upon our experiences building and using end-to-end tracing infrastructures, this paper distills the key design axes that dictate trace utility for important use cases. Developing tracing infrastructures without explicitly understanding these axes and choices for them will likely result in infrastructures that are not useful for their intended purposes. In addition to identifying the design axes, this paper identifies good design choices for various tracing use cases, contrasts them to choices made by previous tracing implementations, and shows where prior implementations fall short. It also identifies remaining challenges on the path to making tracing an integral part of distributed system design.
KEYWORDS: Cloud computing, Distributed systems, Design, End-to-end tracing
FULL TR: pdf