Capturing and Using Scientific Data Provenance

Barbara Lerner
Elizabeth Fong
Mount Holyoke College
Emery Boose
Aaron Ellison
Harvard Forest
Margo Seltzer
University of British Columbia
Thomas Pasquier
University of Bristol
Joe Wonsil
Carthage College
Orenna Brand
Columbia University

Collecting Provenance from R Scripts with RDataTracker

RDataTracker is an R package that contains functions used to collect data provenance during an R console session or while executing an R script. To use RDataTracker, the user can record a console session or run a script contained in a file. In addition to its normal operation, using RDataTracker to execute a script will create a JSON file containing the provenance of that script execution. It will also have stored the intermediate values calculated during the execution and saved copies of the script, its input and output files, and plots created.

Here is an example of how the scientist would collect provenance from an interactive console session

library(RDataTracker)

prov.init()
Initializes provenance collection.

...

calibrated.data <- data * calibration.factor
Then, the scientist enters normal R code.
...

plot.data(calibrated.data, "calibrated-plot.jpeg")

...

prov.quit()
Finally, prov.quit saves the provenance.

Alternatively, if the script resides in a file named calibrate.R, the user can use execute prov.run ("calibrate.R") to run the script, collecting provenance as it does so.