Georgia Tech Center for Experimental Research in Computer Systems Technical Report GIT-CERCS-12-08.
Hrishikesh Amur^, Wolfgang Richter, David G. Andersen, Michael Kaminsky^, Karsten Schwan^, Athula Balachandran, Erik Zawadzki
Carnegie Mellon University
Pittsburgh, PA 15213
*Intel Labs
^Georgia Institute of Technology
Memory is rapidly becoming a precious resource in many data processing environments. This paper introduces a new data structure called a Compressed Buffer Tree (CBT). Using a combination of buffering, compression, and lazy aggregation, CBTs can improve the memory efficiency of the GroupBy-Aggregate abstraction which forms the basis of many data processing models like MapReduce and databases. We evaluate CBTs in the context of MapReduce aggregation, and show that CBTs can provide significant advantages over existing hashbased aggregation techniques: up to 2X less memory and 1.5X the throughput, at the cost of 2.5X CPU.
FULL PAPER: pdf