To improve quality of an implementation of a distributed and/or multi-threaded system, software engineers inspect code and run tests. However, the concurrent nature of such systems makes these tasks challenging. While the effectiveness of code inspection varies with the ability of a software engineer to fathom possible scenarios in which threads could interact, in testing, repeated execution of the same test can have different outcomes. In practice, the concurrent nature of distributed and multi-threaded systems is commonly addressed by stress testing, which repeatedly executes a test hoping that eventually all interesting outcomes of the test will be exercised.
This project explores an alternative to stress testing called systematic testing, which controls the order in which important concurrent events occur. By doing so, the method can systematically enumerate different scenarios in which important concurrents events may execute. Our systematic testing method is implemented as part of the dbug tool, which enables systematic testing of unmodified distributed and multi-threaded systems designed for POSIX-compliant operating systems.
The dbug tool can be thought of as a light-weight model checker, which uses both the implementation of a distributed and multi-threaded system and its test as an implicit description of the state space to be explored. In this state space, the dbug tool performs a reachability analysis checking for a number of safety properties including the absence of 1) deadlocks, 2) conflicting non-reentrant function calls, and 3) system aborts and runtime assertions inserted by the user.
The systematic testing approach of the dbug tool has been successfully applied to a number of distributed and multi-threaded systems including: 1) Parallel Virtual File System, 2) a distributed key-value storage for FAWN, 3) student implementations of web server proxies from the 15-213 course, and 4) flexible transactional storage Stasis.
dBug Approach: Design & Implementation |
FACULTY
GRAD STUDENTS
The work in this project is based on research supported in part by the DoE, under award number DE-FC02-06ER25767 (PDSI), by the NSF under grant CCF-1019104, and by the MSR-CMU Center for Computational Thinking.
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.