A quasar is a galaxy with an unusually massive black hole in its center. This black hole causes compression and heating of matter around it under the impact of its gravity, which leads to emission of massive radiation, thus increasing the brightness of the galaxy. A quasar may be hundreds of times as bright as a regular galaxy, which makes it visible at very large distances. Modern telescopes allow detection of quasars billions of light years away, at the edge of the observable universe. Astronomers and cosmologists use distant quasars as "beacons," which allow charting remote regions of the universe and studying its expansion.
Accurate identification of quasars is a hard problem, since regular
telescope images do not provide explicit distance data, and quasars
look like "shiny dots," not much different from stars and regular
galaxies. To distinguish objects of different types, astronomers
compare images made through five different color filters, and study
the distribution of each object's brightness over these colors. While
researchers have developed algorithms for accurate identification of
nearby quasars, the problem of identifying distant quasars has turned
out to be more challenging. We are working on application of data
mining to this problem, specifically, finding quasars that are more
than twelve billion light years away.
We represent a celestial object by five numeric values, which show its
brightness in images made through five color filters, and apply
supervised learning to identify distant quasars based on these values.
We have experimented with a variety of learning techniques, including
decision trees, support vector machines, clustering, and nearest
neighbors.
The most effective among the techniques evaluated so far is the
majority-vote combination of the C4.5 decision trees, support vector
machines with RBF kernel, and 11 nearest neighbors. Its precision is
about 81%, which means that 81% of the objects identified by the
system are true quasars. It is about twice as accurate as the earlier
methods developed by astronomers. Its main drawback is low recall,
which is about 30%; that is, the system detects only 30% of distant
quasars and misses the other 70%. Intuitively, it achieves high
precision by being conservative and rejecting objects in case of
uncertainty.
More details: Summary of
the algorithms and empirical results
We are working on improvements to the developed technique, and
implementing its distributed version, which will be able to process
datasets with hundreds of millions of objects. We are also testing
applicability of other data-mining algorithms to this problem.
FACULTY
Eugene Fink
Garth Gibson
Julio López
GRADUATE STUDENTS
Bin Fu
EXTERNAL COLLABORATORS
Joel Welling (Pittsburgh Supercomputing Center)
Michael Wood-Vasey (Physics and Astronomy, University of Pittsburgh)