Capturing and Using Scientific Data Provenance

Barbara Lerner
Elizabeth Fong
Mount Holyoke College
Emery Boose
Aaron Ellison
Harvard Forest
Margo Seltzer
University of British Columbia
Thomas Pasquier
University of Bristol
Joe Wonsil
Carthage College
Orenna Brand
Columbia University

Project Overview

Bringing Provenance Collection to Scientists

Harvard Forest sensor processing In this project, computer scientists from Mount Holyoke College and the University of Britsh Columbia are collaborating with environmental scientists from Harvard Forest to explore how to capture and productively use scientific data provenance.

Scientists routinely collect "factual" data provenance such as who collected the data, where, when, and with what instruments. Increasingly, scientists are using workflow tools, like Kepler, Taverna, and Vistrails, among others, to collect workflow provenance describing how data moves between websites and a variety of tools. However, little work has been done collecting provenance about data analysis steps done within tools that perform potentially complex data analyses and transformations. We are addressing this issue by collecting provenance from the execution of R, a language used widely for statistical analysis and plotting.

The provenance data that we collect are stored in files, along with copies of input and output data and plots, snapshots of intermediate computations and the source of the R scripts that are executed. In addition to archiving the provenance, we are also developing tools that use the provenance to enable a scientist to clean and debug their code and create a virtual machine that allows the exact computation to be reproduced.