Carnegie Mellon University School of Computer Science Ph.D. Dissertation CMU-CS-14-144. December 2014.
James Cipar
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
http://www.pdl.cmu.edu
Many data management systems are faced with a constant, high-throughput stream of updates. In some cases, these updates are generated externally: a data warehouse system must ingest a stream of external events and update its state. In other cases, they are generated by the application itself: large-scale machine learning frameworks maintain a global shared state, which is used to store the parameters of a statistical model. These parameters are constantly read and updated by the application.
In many cases, there is a trade-off between the freshness of the data returned by read operations and the efficiency of updating and querying the data. For instance, batching many updates together will significantly improve the update throughput for most systems. However, batching introduces a delay between when an update is submitted and when it is available to queries.
In this dissertation, I examine this trade-off in detail. I argue that systems should be designed so that the trade-off can be made by the application, not the data management system. Furthermore, this trade-off should be made at query time, on a per-query basis, not as a global configuration.
To demonstrate this, I describe two novel systems. LazyBase is a data warehouse system originally designed for to store meta-data extracted from enterprise computer files, for the purposes of enterprise information management. It batches updates and processes them through a pipeline of transformations before applying them to the database, allowing it to achieve very high update throughput. The novel pipeline query mechanism in LazyBase allows applications to select their desired freshness at query time, potentially reading data that is still in the update pipeline and has not yet been applied to the final database.
LazyTables is a distributed machine learning parameter server - a shared storage system for sparse vectors and matrices that make up the bulk of the data in many machine learning applications. To achieve high performance in the face of network delays and performance jitter, it makes extensive use of batching and caching, both in the client and server code. The Stale Synchronous Parallel consistency model, conceived for LazyTables, allows clients to specify how out-of-sync different threads of execution may be.
KEYWORDS: Distributed systems, databases, freshness, staleness, machine learning, parameter server, OLAP, data warehouse.
FULL PAPER: pdf