Contact: |
www | |
Office: Phone: Fax: Admin: |
GHC 8019 (412) 268-1457 (412) 268-5576 Barbara Grandillo - (412) 268-7550 |
Mailing Address: | Computer Science Department School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213-3891 |
Position: Projects: |
Professor, SCS Data Mining, Active Disks, Sensor Data Mining, Disk Workload Characterization with Fractals |
My research focus is on Databases and specifically, on fast searching methods for multimedia and medical-image databases, on data mining and on performance issues for large datasets.
The first project examines fast methods for approximate matching in multimedia databases. Typical queries are as follows: "in a collection of product photographs, find products that look like tennis shoes;" "in a collection of medical X-rays, find ones that look like the X-ray of the current patient, and list the corresponding diagnoses." The main idea behind our approach is to extract n features from the objects of interest (typically, with the help of a domain expert), thus mapping each object into a point in n-dimensional feature space. Subsequently, we can use state-of-the-art database techniques (like the 'R-trees') to store and retrieve these n-dimensional points. The philosophy of our approach is to provide a the vast majority of irrelevant objects. Allowing some 'false alarms' is acceptable, because they can be easily discarded by a elaborate test, or even by the user. We have already successful sets of features for 2-d color images, 2-d shapes, and 1-d time series (such as stock-price movements). Depending on the domain, we are experimenting with modern signal processing techniques, such as the discrete wavelet transform for sound and images, hidden Markov chains for digitized voice, the discrete cosine transform for stock-price time series.
A second project focuses on data mining. The goal is to discover correlations ('rules') in a collection of records. For example, in a set of patient records with demographic characteristics, symptoms and diagnoses, we would like to find all the 'interesting' rules (e.g., 'patients 50 years old with cholesterol 300 have 10% probability of heart attack'). We are studying traditional clustering methods as well as statistical methods like the 'Singular Value Decomposition' (SVD), to do cluster analysis and rule discovery.
The last project examines algorithms and architectures for fast execution of expensive database operations. We are working on Active Disks for data mining, on striping and placement algorithms for video servers, and on buffering algorithms for database operations.