PARALLEL DATA LAB 

PDL Abstract

DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics

Proceedings of the ACM International Symposium on High Performance Distributed Computing (HPDC), Chicago, IL. June, 2010. Supersedes Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-10-104.

Bin Fu, Kai Ren, Julio López, Eugene Fink, Garth A. Gibson

Parallel Data Laboratory
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213

http://www.pdl.cmu.edu/

DiscFinder is a scalable, distributed, data-intensive group finder for analyzing observation and simulation astrophysics datasets. Group finding is a form of clustering used in astrophysiscs for identifying large-scale structures such as galaxies and clusters of galaxies. DiscFinder runs on commodity compute clusters and scales to large datasets with billions of particles. It is designed to operate on datases that are much larger than the aggregate memory available in the computers where it executes. As a proof-of-concept we have implemented DiscFinder as an application on top of the Hadoop framework. DiscFinder has been used to cluster the largest open-science cosmology simulation datasets containing as many as 14.7 billion particles. We evaluate its performance and scaling properties and describe the performed optimization.

FULL PAPER: pdf