DATE: Friday, May 24, 2013
TIME: 12:00 pm - 1:00 pm
PLACE: CIC - 4th floor (ISTC Panther Hollow Room)
SPEAKER: Michael J. Carey, UC Irvine
TITLE: Introducing AsterixDB: A Next-Generation Big Data Management System - slides
ABSTRACT:
For the last three and a half years, the ASTERIX project at UCI has been mixing ideas from three distinct areas: semi-structured data management, parallel databases, and data-intensive computing (a.k.a. today's Big Data platforms) to create a next-generation, open-source software platform that scales by running on large, shared-nothing commodity computing clusters. The fruits of this labor have been captured in the AsterixDB system that is now being released for unrestricted public use. We like to think that the arrival of AsterixDB will mark the start of "Big Data Management 2.0", and we hope that the Big Data management community will find AsterixDB to be useful for a much broader class of problems than what any one of today's current Big Data platforms (e.g., Hadoop, Pig, Hive, HBase, Cassandra, MongoDB) can address. One of our primary project mottos has been "one size fits a bunch."
In a nutshell, AsterixDB is a full-function BDMS (Big Data Management System) with a rich feature set that makes it well-suited to modern needs such as web data warehousing or social data storage and analysis. AsterixDB has:
- A semistructured NoSQL style data model (ADM) resulting from extending JSON with object database ideas
- An declarative query language (AQL) that supports a broad range of queries and analysis over semistructured data
- A parallel runtime query execution engine, Hyracks, that has been scale-tested on up to 1000+ cores and 500+ disks
- Partitioned LSM-based data storage and indexing to support ingestion and management of semistructured data
- Support for access to externally stored data (e.g., data in HDFS) as well as to data stored natively by AsterixDB
- A rich set of primitive types, including spatial and temporal data in addition to integer, floating point, and text data
- Secondary indexing options including B+ trees, R trees, and inverted keyword (exact and fuzzy) indexes
- Support for queries with fuzzy and spatial predicates as well as more traditional parametric queries
- Basic transactional (i.e., concurrency and recovery) capabilities akin to those of a NoSQL store
BIO:
Michael J. Carey is currently a Bren Professor of Information and Computer Sciences at UC Irvine. Immediately prior to joining UCI in 2008, Carey worked at BEA Systems for seven years and served as the chief architect of (and an engineering director for) BEA's AquaLogic Data Services Platform product. Carey also spent twelve years as a professor at the University of Wisconsin-Madison, five years at IBM Almaden as a database researcher/manager, and a year and a half as a Fellow (and briefly the VP of Software) at e-commerce software startup Propel Software during the 2000-2001 Internet bubble. Carey is an ACM Fellow, a member of the National Academy of Engineering, and a recipient of the ACM SIGMOD E. F. Codd Innovations Award. His current research interests are centered around data-intensive computing and scalable data management (a.k.a. Big Data).
HOST: Michael Kozuch
VISITOR COORDINATOR: Jennifer Gabig, jennifer4@cmu.edu
SDI / ISTC SEMINAR QUESTIONS?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/