5th Workshop on Extreme-Scale Storage and Analysis (ESSA 2024), May 2024.
M. Dorier†, P. Carns†, R. Ross†, S. Snyder†, R. Latham†, A. Gueroudji†, G. Amvrosiadis, C. Cranor, J. Soumagne‡
Carnegie Mellon University
† Argonne National Laboratory
‡ Intel Corporation
High-performance computing (HPC) applications and workflows are increasingly making use of custom data services to complement traditional parallel file systems with fast transient data management capabilities tailored to applicationspecific needs. In the Mochi project we provide methodologies and tools that enable rapid development of custom HPC data services, including a collection of composable software components that can be combined to build complex distributed data services. Our initial version of Mochi targeted data services deployed with static configurations with a fixed number of nodes and minimal fault tolerance. However, there is a growing need for dynamic services that can adapt while running in response to changing workloads and system conditions.
In this paper we present our work to extend the Mochi architecture to support the development of dynamic data services. We achieve this by providing new Mochi components that support unified bootstrapping and online reconfiguration, fault detection, monitoring, and consensus. We also provide a methodology for deriving service-wide resilience from the resilience of each of the service’s components.
FULL TR: pdf