DATE: Tuesday, December 4, 2012
TIME: 1:00 - 2:00 pm
PLACE: GHC 8102

SPEAKER: Milo Polte, WibiData

TITLE: Evolving, Complex Schema Management for Distributed, Column-Oriented NoSQL Databases

ABSTRACT:
Unlike traditional relational database management systems, tabular NoSQL databases such as Cassandra, HBase, and Accumulo store untyped byte array values across arbitrary columns. Along with their extreme scalability and fault tolerance, this flexibility has been touted as one of the advantages of these Big Data technologies. But this very freedom can become an application maintainability and data management nightmare when data sets grow older and larger and knowledge of table and data structures lives only as assumptions in code, rather than as annotations in storage.

Data naturally has consistent structure and schemas, and storing typed data can improve code readability, safety, and maintainability. Yet picking "one true schema" a priori for an evolving application is a difficult problem. Our challenge is to find the balanced approach that provides useful layout and schema management on top of NoSQL databases without sacrificing storage space, scalability, or developer flexibility. In this talk, I will discuss the experiences and goals that motivated WibiData's approach to this problem and describe our open source solution.

This talk will focus on our experiences with HBase (http://hbase.apache.org/) and our open source Kiji Project (http://www.kiji.org/), but the lessons covered can be generalized to a broader range of large scale non-relational databases.

BIO:
Milo Polte is a member of the technical staff at WibiData, Inc. where he has worked since the beginning of 2012, building a platform for large scale personalized applications. He earned his Masters from CMU in 2011, working on distributed storage systems with Garth Gibson in the PDL.

HOST:
Garth Gibson

VISITOR COORDINATOR:
Jenn Landefeld (jennsbl@cs.cmu.edu)

SDI / ISTC SEMINAR QUESTIONS?
Karen Lindenfelser, 86716, or visit www.pdl.cmu.edu/SDI/