54th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 2021, Oct. 2021.
Minh S. Q. Truong, Eric Chen, Deanyone Su, Alexander Glass, Liting Shen, L. Richard Carley, James A. Bain, Saugata Ghose*
Carnegie Mellon University
* University of Illinois
To combat the high energy costs of moving data between main memory and the CPU, recent works have proposed to perform processing-using-memory (PUM), a type of processing-in-memory where operations are performed on data in situ (i.e., right at the memory cells holding the data). Several common and emerging memory technologies offer the ability to perform bitwise Boolean primitive functions by having interconnected cells interact with each other, eliminating the need to use discrete CMOS compute units for several common operations. Recent PUM architectures extend upon these Boolean primitives to perform bit-serial computation using memory. Unfortunately, several practical limitations of the underlying memory devices restrict how large emerging memory arrays can be, which hinders the ability of conventional bit-serial computation approaches to deliver high performance in addition to large energy savings.
In this paper, we propose RACER, a cost-effective PUM architecture that delivers high performance and large energy savings using small arrays of resistive memories. RACER makes use of a bit-pipelining execution model, which can pipeline bit-serial w-bit computation across w small tiles. We fully design efficient control and peripheral circuitry, whose area can be amortized over small memory tiles without sacrificing memory density, and we propose an ISA abstraction for RACER to allow for easy program/compiler integration. We evaluate an implementation of RACER using NORcapable ReRAM cells across a range of microbenchmarks extracted from data-intensive applications, and find that RACER provides 107×, 12×, and 7× the performance of a 16-core CPU, a 2304-shadercore GPU, and a state-of-the-art in-SRAM compute substrate, respectively, with energy savings of 189×, 17×, and 1.3×.
FULL PAPER: pdf