Capturing and Using Scientific Data Provenance

Barbara Lerner
Elizabeth Fong
Mount Holyoke College

Emery Boose
Aaron Ellison
Harvard Forest

Margo Seltzer
University of British Columbia

Thomas Pasquier
University of Bristol

Joe Wonsil
Carthage College

Orenna Brand
Columbia University

Tools that Use Provenacne

Fruits of provenance In this project, we have focused on the collection of provenance from R scripts using a tool called RDataTracker. The collected provenance consists of an execution history of the script, including intermediate data values, copies of scripts used, as well as input and output files and plots created. This information is stored in the file system for later use.

The main focus of our current research is thinking beyond provenance as purely archival data. How can scientists leverage provenance to improve the science?

Understanding and debugging scripts

Provenance can help a scientist understand how a script is behaving. Using this knowledge, the scientist can determine what is causing a script to produce incorrect output, or to convince the scientist that the behavior is correct.

One tool to help understand script behavior is provViz, a visualizer that allows the scientist to see which lines of code were executed, variable values before and after those lines are executed, and saved copies of input and output files and graphs.

Another tool that allows a scientist to gain a better understanding of their script's behavior is provDebugR. provDebugR is a time-travelling debugger. It allows the user to examine the state of the variables in their script at different points of execution, after the script has completed execution. It provides access to the same information as provViz, but with a user interface that is familiar to anyone used to using a debugger. With this tool a script can be debugged by running it once and examining the collected provenance, rather than setting breakpoints, inserting print statements, and running the script multiple times.

Reproducing script execution

An important problem in science is the need for reproducibility. While this is clearly evident in experiments that require the collection and measurement of physical artifacts, it is also a problem for reproducing results from data analysis. Having the data and script is not necessarily sufficient to be able to reproduce the results. Libraries that a script depends on may change, the language implementation may change, the platform on which the script is executed might change. Any of these changes can result in the script no longer producing the same result. Encapsulator is a tool that builds a virtual machine that encapsulates the script, its data, and the libraries it uses into a virtual machine that can be re-run to produce the exact result. An encapsulated script could be associated with a publication, giving scientists the opportunity to verify and build on each other's analytic work in a very concrete way.

Cleaning script code

Scientists often build scripts in an exploratory way. They may write some code to read data, do some analysis and produce a plot. They will then study the plot and develop more questions which they answer by writing more code, producing more plots, etc. In the end, they may want to write a paper or share code that is based on only some of the output plots. It can be a tedious and error-prone process to manually disentangle the code and be left with just the code that produced the most valuable results. Rclean uses provenance to identify which lines of code were used to produce the output of interest. It then extracts those lines of code from the original script, producing a cleaned version of the code.