PARALLEL DATA LAB 

PDL Abstract

Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate

Microsoft Research eScience Workshop 2009, Pittsburgh, PA, October 16-17, 2009.

Garth A. Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin Xiao

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

Modern science has available to it, and is more productively pursued with, massive amounts of data, typically either gathered from sensors or output from some simulation or processing. The table below shows a sampling of data sets that a few scientists at Carnegie Mellon University have available to them or intend to construct soon. Data Intensive Scalable Computing (DISC) couples computational resources with the data storage and access capabilities to handle massive data science quickly and efficiently. Our topic in this extended abstract is the effectiveness of the data intensive file systems embedded in a DISC system. We are interested in understanding the differences between data intensive file system implementations and high performance computing (HPC) parallel file system implementations. Both are used at comparable scale and speed. Beyond feature inclusions, which we expect to evolve as data intensive file systems see wider use, we find that performance does not need to be vastly different. A big source of difference is seen in their approaches to data failure tolerance: replication in DISC file systems versus RAID in HPC parallel file systems. We address the inclusion of RAID in a DISC file system to dramatically increase the effective capacity available to users. This work is part of a larger effort to mature and optimize DISC infrastructure services.

FULL PAPER: pdf