PARALLEL DATA LAB 

PDL Abstract

WindMine: Fast and Effective Mining of Web-click Sequences

2011 SIAM International Conference on Data Mining, April 28-30, 2011, Mesa, AZ.

Yasushi Sakurai*, Lei Li, Yasuko Matsubara**, Christos Faloutsos

School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213

*NTT Communication Science Labs
**Kyoto University

http://www.pdl.cmu.edu/

Given a large stream of users clicking on web sites, how can we find trends, patterns and anomalies? We have developed a novel method, WindMine, and its fine-tuning sibling, WindMine-part, to find patterns and anomalies in such datasets. Our approach has the following advantages: (a) it is effective in discovering meaningful "building blocks" and patterns such as the lunch-break trend and anomalies, (b) it automatically determines suitable window sizes, and (c) it is fast, with its wall clock time linear on the duration of sequences. Moreover, it can be made sub-quadratic on the number of sequences (WindMine-part), with little loss of accuracy. We examine the effectiveness and scalability by performing experiments on 67 GB of real data (one billion clicks for 30 days). Our proposed WindMine does produce concise, informative and interesting patterns. We also show that WindMine-part can be easily implemented in a parallel or distributed setting, and that, even in a single-machine setting, it can be an order of magnitude faster (up to 70 times) than the plain version.

FULL PAPER: pdf