PROBLEM ANALYSIS: Extended Overview

We are exploring methodologies and algorithms for automating the analysis of failures and performance degradations in large-scale systems. Problem analysis includes such crucial tasks as identifying which component(s) misbehaved and the likely root causes, diagnosing performance problems, and providing supporting evidence for any conclusions. Combining statistical tools with appropriate instrumentation, we hope to dramatically reduce the difficulty of analyzing performance and reliability problems in deployed storage systems. Such tools, integrated with automated reaction logic, also provide an essential building block for the longer-term goal of self-healing.

Automating problem analysis is crucial to achieving cost-effective systems at the scales needed for tomorrow’s high-end computing. The number of hardware and software components in such systems will make problems common rather than anomalous, so it must be possible to quickly move from problem to fix with little to no system downtime for analysis.

Further, the complexity of such distributed software systems makes by-hand analysis increasingly untenable. More nuanced, but perhaps of most concern, implementors of scalable applications (e.g., parallel storage) are increasingly unable to test in representative high-end computing environments—they simply cannot afford to replicate the necessary system scale. As a result, scale-related problems must be analyzed in the field to allow improvements to be made, introducing delays and reducing productivity for customers/users. Issues of clearance for systems deployed to support highly sensitive activities must also be taken into consideration. Current designs and tools fall far short of what is needed.

Currently, we are developing techniques for understanding the trade-offs associated with instrumentation and algorithms for hands-off problem analysis, including:

Two interrelated research challenges are evident. First, statistical tools will play a crucial role in accurate problem diagnosis and analysis schemes. The difficulty will be to understand which ones work most effectively in various situations. Second, the impact of instrumentation detail on the effectiveness of those tools must be well-understood to justify the associated instrumentation costs. Both efforts will require extensive experimentation and deep understanding of real case studies.




© 2017. Last updated 12 March, 2012