DISC: Data-Intensive Super Computing
The leading Internet search providers have created a new class of large-scale computer systems to support their businesses. We are formulating a plan for a research project that extends the type of computing systems used for Internet search to a larger range of applications. We refer to such systems as "Data-Intensive Super Computing" (DISC) systems. DISC systems differ from conventional supercomputers in their focus on data: they acquire and maintain continually changing data sets, in addition to performing large-scale computations over the data. With the massive amounts of data arising from such diverse sources as telescope imagery, numerical simulations, medical records, online transaction records, and web pages, DISC systems have the potential to achieve major advances in science, health care, business efficiencies, and information access. DISC opens up many important research topics in system design, resource management, programming models, parallel algorithms, and applications. By engaging the academic research community in these issues, we can more systematically and in a more open forum explore fundamental aspects of a societally important style of computing.
Applications
- Web search without language barriers.
- Inferring biological function from genomic sequences
- Predicting and modeling the effects of earthquakes
- Discovering new astronomical phenomena from telescope imagery data
- Synthesizing realistic graphic animations
- Understanding the spatial and temporal patterns of brain behavior based on MRI data
Research Areas
- Programming models for DISC systems
- Methodologies and tools for supporting software development in DISC systems
- Runtime software support for DISC systems
- Resource management and sharing
- Hardware and processor design for DISC systems.
Challenges
- How should the processors be designed for use in cluster machines?
- How can we effectively support different scientific communities in their data management and applications?
- Can we radically reduce the energy requirements for large-scale systems?
- How do we build large-scale computing systems with an appropriate balance of performance and cost?
- How can very large systems be constructed given the realities of component failures and repair times?
- Can we support a mix of long-running data-intensive jobs with ones requiring interactive response?
- How do we control access to the system while enabling sharing?
- Can we deal with bad or unavailable data in a systematic way?
- Can high performance systems be build from heterogeneous components?
News
Yahoo! press releases:
Associated Projects
People
FACULTY
Randy Bryant
Greg Ganger
Garth Gibson
Julio López
David O'Hallaron
GRADUATE STUDENTS
Wittawat Tantisiriroj
EXTERNAL COLLABORATORS
Gary Grider (LANL)
James Nunez (LANL)
Jay Kistler (Yahoo!)
Chris Olston (Yahoo!)
Publications
- Applying Idealized Lower-bound Runtime Models to Understand Inefficiencies in Data-intensive Computing (Extended Abstract). Elie Krevat, Tomer Shiran, Eric Anderson, Joseph Tucek, Jay J. Wylie, Gregory R. Ganger: SIGMETRICS 2011: 125-126, San Jose, CA, June 7-11, 2011.
Abstract / PDF [297K]
- Applying Performance Models to Understand Data-intensive Computing Efficiency. Elie Krevat, Tomer Shiran, Eric Anderson†, Joseph Tucek†, Jay J. Wylie†, Gregory R. Ganger. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-108. May 2010.
Abstract / PDF [304K]
- Understanding and Maturing the Data-Intensive Scalable Computing Storage Substrate. Garth Gibson, Bin Fan, Swapnil Patil, Milo Polte, Wittawat Tantisiriroj, Lin Xiao.
Microsoft Research eScience Workshop 2009, Pittsburgh, PA, October 16-17, 2009.
Abstract / PDF [520K]
- Data-Intensive Supercomputing:
The Case for DISC.
Randal E. Bryant. Carnegie Mellon University School of Computer Science Tech Report CMU-CS-07-128.
May 10, 2007.
PDF
- Data-Intensive Supercomputing: Presentation to the 2007 Federated Computing Research Conference (FCRC)
V1 | Revised Version
Presentations
- Improving Storage Services in the Cloud. Garth Gibson. 3rd Open Cirrus Summit, Seoul, Korea, June 8-9, 2010.
Quicktime MOV [250MB, ~15 min]
Acknowledgements
We thank the members and companies of the PDL Consortium: Amazon, Bloomberg, Datadog, Google, Honda, Intel Corporation, IBM, Jane Street, Meta, Microsoft Research, Oracle Corporation, Pure Storage, Salesforce, Samsung Semiconductor Inc., Two Sigma, and Western Digital for their interest, insights, feedback, and support.