DataHub: Collaborative Data Science & Dataset Version Management at Scale

Aaron Elmore (Massachusetts Institute of Technology / University of Chicago)

Abstract: Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. In this talk I outline the challenges in providing dataset version control at scale. DataHub is a collaboration between the University of Chicago, University of Illinois Urbana-Champaign, University of Maryland, and MIT.

Bio: Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago starting June 2015. Aaron is currently a Postdoctoral Associate at MIT working with Mike Stonebraker on elastic and multitenant database systems, and Sam Madden on the collaborative data analytic platform, DataHub. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara under the supervision of Divy Agrawal and Amr El Abbadi.