ASPLOS '26: 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems Pittsburgh PA USA March 22 - 26, 2026
Yixuan Mei, Shreya Varshini*, Harish Dixit*, Sriram Sankar*, K. V. Rashmi
Carnegie Mellon University
*Meta Platforms, Inc.
Silent Data Corruption (SDC) poses a reliability threat inmodern datacenters. These insidious errors evade detectionsand propagate incorrect results throughout the system. Companies including Google, Meta, and Alibaba have reported SDC incidents affecting their production. In this paper, we present the first comprehensive instruction- and application-level analysis of vector instruction SDCs in hyper-scale datacenters using a two-stage approach. We perform over 78 trillion test rounds in more than 14 billion CPU seconds.Our observations reveal undocumented SDC patterns that provide insights into possible underlying causes and inspire new mitigation strategies. Based on these findings, we propose a low-overhead SDC detection mechanism leveraging in-application algorithm-based fault tolerance. Our method achieves 88% to 100% SDC machine detection rate with a time overhead of only 1.35% even for modestly sized inputs.
FULL PAPER: pdf