PARALLEL DATA LAB

DISC-Distances:

ANALYZING THE DISTRIBUTION OF DISTANCES
BETWEEN GALAXIES

DistancesWe are working on algorithms for computing correlation functions, which is a cosmological application for analyzing the distribution of matter in the universe. We are also addressing the related problem of determining the fractal dimension of the universe. This project is part of the work on astronomy applications of data-intensive super computing (Astro-DISC).

PROBLEM

The computation of two-point correlation functions involves representing galaxies as points in three-dimensional space; counting the pairs of galaxies within different distance ranges from each other; and building a histogram that shows the dependency of the number of pairs on the distance. Cosmologists use such histograms as one of their tools for analyzing large-scale properties of the universe and the history of its evolution. A straightforward algorithm for counting galaxy pairs takes quadratic time, which is impractically slow for massive astronomical surveys and cosmological simulations. We are working on distributed approximation algorithms for fast accurate estimation of correlation functions.

RESULTS

We have developed and evaluated several alternative algorithms, including the exact computation with O(n5/3) average running time; an approximate linear-time computation; and another approximate algorithm with sub-linear time, based on sampling of a dataset and finding correlations functions for the sample. We have also designed a top-level control module, which selects an appropriate combination of exact and approximation algorithms for a given dataset size and accuracy requirements. The resulting system can process trillions of galaxies and compute correlation functions with 1% precision.

More details: Summary of the algorithms and empirical results

CHALLENGES

We are working on a distributed version, which will allow control of the tradeoff among the approximation accuracy, running time, and number of used cores. We are also developing algorithms for three-point correlation functions, which compute the distribution of galaxy triples by distances among them. A longer-term challenge is to integrate the developed techniques with other astronomical applictions and build a general-purpose architecture for the analysis of massive astronomical data.

PEOPLE

FACULTY
Eugene Fink
Garth Gibson
Julio López
Christos Faloutsos

GRADUATE STUDENTS
Bin Fu

EXTERNAL COLLABORATORS
Tiziana Di Matteo (Physics, Carnegie Mellon University)
Rupert Croft (Physics, Carnegie Mellon University)