PARALLEL DATA LAB 

PDL Talk Series

JuLY 31, 2024


TIME
: 12:00 noon - to approximately 1:00 pm EDT
PLACE: Virtual - a zoom link will be emailed closer to the seminar


SPEAKER: Jacob Baskin
Software Engineer, Jane Street

Superstore: What We Learned Building a Data Warehouse
In 2022, Jane Street decided to build an on-premises data warehouse, called Superstore, which launched in 2023 and stores about 2PB of data. While we used existing software for most of the heavy lifting, some of our design decisions were a bit more customized. In this talk, I will give a brief architecture overview of Superstore and discuss the choices we made, how they worked in practice, and what we could or should have done differently. Is Parquet the storage format of the future? Does data locality matter? How do you efficiently handle arbitrarily wide data sets with a fixed amount of RAM? Our opinions on all these questions have changed significantly in the past year.

BIO: Jacob Baskin is a software engineer at Jane Street. His previous jobs have included CTO and co-founder of Coord, an urban transportation startup, and software engineer at Google. His focus is on managing data effectively at the application level at scales ranging from "big" to "artisanal small-batch". He graduated from Brown University with a B.A. in Computer Science.


CONTACTS


Director, Parallel Data Lab
VOICE: (412) 268-1297


Executive Director, Parallel Data Lab
VOICE: (412) 268-5485


PDL Administrative Manager
VOICE: (412) 268-6716