Hephaestus: Accelerating Scientific Discovery with Data Reuse

Jennie Duggan (Northwestern University)

Abstract: Data-intensive science, wherein domain experts use big data analytics in the course of their research, is becoming increasingly common in the physical and social sciences. Moreover, data reuse is becoming the new normal, owing to the open data movement and arrival of big science experiments such as the Large Hadron Collider. Here, a small group of researchers, often using exotic equipment, produce a dataset that is shared by thousands. Unfortunately, weak and spurious correlations are also on the rise in research. For example, Google Flu Trends published their algorithms in 2008 for use in public health, and in the intervening time its accuracy has plummeted. In the 2011-2012 flu season, this system produced estimates more than 50% higher than the number of cases reported by the U.S. Center for Disease Control.
This work first examines common pitfalls associated with data-intensive science and how they contribute to irreproducible results. We then propose a system for conducting virtual experiments over existing data. It simulates randomized controlled trials by reframing the principles of empirical research. These virtual experiments underpin a larger platform we call Hephaestus. This framework accumulates virtual experiments in a visualization to help scientists identify consistencies and anomalies in an area of research. We then highlight a set of research challenges associated with this platform. We argue that by using this approach, data-intensive science may come to achieve accuracy on par with its causality-driven predecessors.

Bio: Jennie Duggan is an assistant professor at Northwestern University in the EECS department. Before that, she was a postdoctoral associate in the Database Group at MIT CSAIL, working with Michael Stonebraker and Sam Madden. She received her Ph.D. from Brown University in 2012, where she was advised by Ugur Cetintemel. Before that, she was a scientist working at the US Navy research base in Newport, RI.  Her research interests include scientific data management, database workload modeling, personal data management, and cloud computing. She is especially focused on making data-driven science applications fast and scalable.