Erasure codes have been widely adopted for imparting resource-efficient resilience to storage and communication systems. Coded-computation is a field of coding theory which aims to use erasure codes to impart resilience against slowdowns and failures that occur in distributed computing systems.
Figure 1 shows an example of using coded-computation to impart resilience over the distributed computation of a function F. As depicted in the figure, coded-computation (1) encodes inputs to the computation to generate “parity inputs,” (2) performs computation F over all original and parity inputs in parallel, and (3) decodes unavailable results of computation using the available results of computation from original and parity inputs. |
Given the ubiquity of distributed execution in modern services, such as web servers, prediction serving systems, data analytics systems, coded-computation offers exciting potential to enable resource-efficient resilience against slowdowns and failures. However, designing erasure codes for coded-computation is fundamentally more challenging than it is for traditional applications of erasure codes because coded-computation involves computing on encoded data. As a result, current approaches toward coded-computation are only able to support highly restricted classes of computations F. This precludes the use of coded-computation in modern distributed services that would benefit from the resource-efficient resilience of erasure codes.
In this project, we study the potential for machine learning to alleviate the difficulty of designing new erasure codes for coded-computation. We propose to integrate machine learning into the coded-computation framework and learn to reconstruct slow or failed results of computation.
We have developed multiple techniques for integrating machine learning into the coded-computation framework. As a first driving application, we have shown the promise of learning-based coded-computation to enable coded-computation for systems that perform inference over neural networks. We have shown that learning-based coded-computation enables accurate reconstruction of unavailable predictions resulting from inference, and significantly reduces tail latency in the presence of resource contention. These benefits come with only a fraction of the resource-overhead of replication-based techniques.
While we have showcased learning-based coded-computation for machine learning inference workloads, the core ideas behind our approach have the potential to expand the reach of coded-computation to a broader class of computations. This may enable erasure codes be applied more broadly in distributed systems.
FACULTY
GRAD STUDENTS
COLLABORATORS
Shivaram Venkataraman, U. Wisconsin-Madison
The following links contain the source code associated with the research performed in this project.
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.