Petascale Data Storage at CMU: Publications / Journals

Failure Tolerance in Petascale Computers. Garth Gibson, Bianca Schroeder, Joan Digney. CTWatch Quarterly, vol. 3 no. 4. Volume on Software Enabling Technologies for Petascale Science. November 2007. www.ctwatch.org
PDF

Understanding Failures in Petascale Computers. Bianca Schroeder, Garth A. Gibson. SciDAC 2007. Journal of Physics: Conference Series 78 (2007) 012022.
Abstract / PDF / Permanent JPCS Link
All 100 open access volumes of the Journal of Physics Conference Series (JPCS)are available via the journal home page: http://herald.iop.org/JPCS_home/m294/crk//link/1520

Understanding Disk Failure Rates: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder, Garth A. Gibson. ACM Transactions on Storage (TOS), Volume 3 Issue 3, October 2007.

Early Experiences on the Journey Towards Self-* Storage. Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, September 2006.
Abstract / PDF

Conferences

Scale and Concurrency of GIGA+: File System Directories with Millions of Files. Swapnil Patil, Garth Gibson. Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST '11), San Jose CA, February 2011. Supersedes Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-10-110, Sept. 2010.
Abstract / PDF [508K]

...And eat it too: High read performance in write-optimized HPC I/O middleware file formats. Milo Polte, Jay Lofstead, John Bent, Garth Gibson, Scott A. Klasky, Qing Liu, Manish Parashar, Norbert Podhorszki, Karsten Schwan, Meghan Wingate, Matthew Wolf. 4th Petascale Data Storage Workshop held in conjunction with Supercomputing '09, November 15, 2009. Portland, Oregon. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-111, November 2009.
Abstract / PDF [388K]

PLFS: A Checkpoint Filesystem for Parallel Applications. John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, Meghan Wingate. Supercomputing '09, November 15, 2009. Portland, Oregon.
Abstract / PDF [388K]

DiskReduce: RAID for Data-Intensive Scalable Computing. Bin Fan, Wittawat Tantisiriroj, Lin Xiao, Garth Gibson. 4th Petascale Data Storage Workshop held in conjunction with Supercomputing '09, November 15, 2009. Portland, Oregon. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-09-112, November 2009.
Abstract / PDF [304K]

Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate. Garth Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin Xiao. Microsoft Research eScience Workshop 2009, Pittsburgh, PA, October 16-17, 2009.
Abstract / PDF [520K]

In Search of an API for Scalable File Systems: Under the table or above it? Swapnil Patil, Garth A. Gibson, Gregory R. Ganger, Julio Lopez, Milo Polte, Wittawat Tantisiroj, and Lin Xiao. USENIX HotCloud Workshop 2009. June 2009, San Diego CA.
Abstract / PDF [260K]

Enabling Enterprise Solid State Disks Performance. Milo Polte, Jiri Simsa, Garth Gibson. 1st Workshop on Integrating Solid-state Memory into the Storage Hierarchy, March 7, 2009, Washington DC.
Abstract / PDF [302K]

Fast Log-based Concurrent Writing of Checkpoints. Milo Polte, Jiri Simsa, Wittawat Tantisiriroj, Garth Gibson, Shobhit Dayal, Mikhail Chainani, Dilip Kumar Uppugandla. Proceedings of the 3rd Petascale Data Storage Workshop held in conjunction with Supercomputing '08, November 17, 2008, Austin, TX.
Abstract / PDF [262K]

Comparing Performance of Solid State Devices and Mechanical Disks. Milo Polte, Jiri Simsa, Garth Gibson. Proceedings of the 3rd Petascale Data Storage Workshop held in conjunction with Supercomputing '08, November 17, 2008, Austin, TX.
Abstract / PDF [99K]

On Application-level Approaches to Avoiding TCP Throughput Collapse in Cluster-Based Storage Systems. E. Krevat, V. Vasudevan, A. Phanishayee, D. Andersen, G. Ganger, G. Gibson, S. Seshan. Proceedings of the 2nd international Petascale Data Storage Workshop (PDSW '07) held in conjunction with Supercomputing '07. November 11, 2007, Reno, NV.
Abstract / PDF

GIGA+: Scalable Directories for Shared File Systems. Swapnil V. Patil, Garth A. Gibson, Sam Lang, Milo Polte. Proceedings of the 2nd international Petascale Data Storage Workshop (PDSW '07) held in conjunction with Supercomputing '07. November 11, 2007, Reno, NV.
Abstract / PDF

Modeling the Relative Fitness of Storage. Michael P. Mesnier, Matthew Wachs, Raja R. Sambasivan, Alice X. Zheng, Gregory R. Ganger. SIGMETRICS'07, June 12-16, 2007, San Diego, California, USA.ACM. Awarded Best Paper.
Abstract / PDF

Fingerpointing Correlated Failures in Replicated Systems. Soila Pertet, Rajeev Gandhi and Priya Narasimhan. USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), Cambridge, MA. April 2007.
Abstract / PDF

MultiMap: Preserving Disk Locality for Multidimensional Datasets. Minglong Shao, Steven W. Schlosser, Stratos Papadomanolakis, Jiri Schindler, Anastassia Ailamaki, Gregory R. Ganger. IEEE 23rd International Conference on Data Engineering (ICDE 2007) Istanbul, Turkey, April 2007. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-05-102. March 2005.
Abstract / PDF

The Computer Failure Data Repository. Bianca Schroeder, Garth Gibson. Invited contribution to the Workshop on Reliability Analysis of System Failure Data (RAF'07) MSR Cambridge, UK, March 2007.
Abstract / PDF

//TRACE: Parallel Trace Replay with Approximate Causal Events. Michael Mesnier, Matthew Wachs, Raja R. Sambasivan, Julio Lopez, James Hendricks, Gregory R. Ganger. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07), February 13-16, 2007, San Jose, CA. Supercedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-06-108, September 2006.
Abstract / PDF

Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? Bianca Schroeder, Garth A. Gibson. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07), February 13--16, 2007, San Jose, CA. Best Paper Award.
Abstract / PDF

A Large Scale Study of Failures in High-performance-computing Systems. Bianca Schroeder, Garth Gibson. International Symposium on Dependable Systems and Networks (DSN 2006). IEEE Transactions on Dependable and Secure Computing (TDSC).
Abstract / PDF

Argon: Performance Insulation for Shared Storage Servers. Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, Gregory R. Ganger. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST '07), February 13--16, 2007, San Jose, CA.
Abstract / PDF

Towards Fingerpointing in the Emulab Dynamic Distributed System. Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, Jay Lepreau. Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS '06), Seattle, WA. Nov. 5, 2006.
Abstract / PDF

Technical Reports

Data-intensive file systems for Internet services: A rose by any other name ... Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-114. October 2008
Abstract / PDF [350K]

GIGA+ : Scalable Directories for Shared File Systems. Swapnil Patil, Garth Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-110. October 2008.
Abstract / PDF [400K]

Characterizing HEC Storage Systems at Rest. Shobhit Dayal. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-109, July 2008.
Abstract / PDF [603K]

User Level Implementation of Scalable Directories (GIGA+). Sanket Hase, Aditya Jayaraman, Vinay K. Perneti, Sundararaman Sridharan, Swapnil V. Patil, Milo Polte, Garth A. Gibson. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-107, May 2008.
Abstract / PDF [1.67M]

File System Virtual Appliances: Third-party File System Implementations without the Pain. Michael Abd-El-Malek, Matthew Wachs, James Cipar, Gregory R. Ganger, Garth A. Gibson, Michael K. Reiter. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-08-106, May 2008.
Abstract / PDF [508K]

Posters

Petascale Data Management: Guided by Measurement. Garth Gibson, PDSI PIs. June 2008, Washington. D.C.
PDF

PDSI Shared Information Resources for HEC Storage. PDSI PIs. ASCR PI meeting, March 31, 2008, Denver, CO.
PDF

PDSI Data Releases and Repositories. PDSI PIs. 6th USENIX Conference on File and Storage Technologies (FAST '08). Feb. 26-29, 2008. San Jose, CA.
PDF

Talks

GIGA+: Scalable Directories for Shared File Systems. Garth Gibson, Carnegie Mellon University. HEC FSIO R&D Conference/HECURA FSIO PI Meeting '08, Arlington, VA. Aug 3 - Aug 6, 2008.
PDF

Performance Insulation and Predictability for Shared Cluster Storage. Greg Ganger, Carnegie Mellon University. HEC FSIO R&D Conference/HECURA FSIO PI Meeting '08, Arlington, VA. Aug 3 - Aug 6, 2008.
PDF

Towards Automated Problem Analysis of Large-Scale Storage Systems. Priya Narasimhan, Carnegie Mellon University. HEC FSIO R&D Conference/HECURA FSIO PI Meeting '08, Arlington, VA. Aug 3 - Aug 6, 2008.
PDF

SciDAC PDSI Update. Garth Gibson, Carnegie Mellon University. HEC FSIO R&D Conference/HECURA FSIO PI Meeting '08, Arlington, VA. Aug 3 - Aug 6, 2008.
PDF

Failure in Supercomputers and Supercomputer Storage. Garth Gibson, Carnegie Mellon University. NSF/DOE Expedition Workshop/Toward Scalable Data Management. June 10, 2008. Washington, D.C.
PDF [3.1M] / MP3 [2.7M]
Abstract: The largest computer systems have entered the era of Peta operations per second and will climb to Exa operations per second over the next decade, largely on the strength of more cores per chip and more chips per system. The inevitable consequence of increasing component counts is more parts that can fail, higher failure rates, more concurrent failures and more effort devoted to coping with and recovering from failures -- a key role for storage systems. In this talk I will review historical data on failure rates in supercomputers to project future failure rates, review growing limitations on traditional fault tolerance strategies for supercomputers based on high-speed checkpointing to parallel storage systems, and address the increasing failure issues in storage components.

Petascale Data Storage Institute - Access Methods. Garth Gibson, Carnegie Mellon University. SDM-PDSI Mini Workshop. Nov 30, 2007. Seattle, WA
PDF [495K]

Understanding Failure in Petascale Computers. Garth Gibson (Joint work with Bianca Schroeder). 2007 SciDAC Conference, June 25, Boston MA.
PDF [899K]