Capturing and Using Scientific Data Provenance

Barbara Lerner
Elizabeth Fong
Mount Holyoke College

Emery Boose
Aaron Ellison
Matt Lau
Harvard Forest

Margo Seltzer
University of British Columbia

Thomas Pasquier
University of Bristol

Joe Wonsil
Carthage College

Orenna Brand
Columbia University

Introduction to Data Provenance

Technology continues to change the way that scientists work. Nearly all scientific data are analyzed with computers and increasingly data are collected directly in electronic form. A good example is provided by sensor networks, which may use electronic sensors and wireless networks to collect vast quantities of data at a very fast rate. Scientific programs, ranging from Excel spreadsheets to supercomputer applications, manipulate the collected data to produce scientific results. Scientists can then disseminate both the raw and processed data quickly and to a broad, unknown audience by publishing it on their websites.

Data processing workflow

Good science requires more than results. It requires reproducibility, verifiability and authentication. Reproducibility is necessary to ensure that the results are not an accidental outcome, but the result of genuine, carefully-performed experimentation and analysis. Verifiability is necessary to assure that the results really did derive from the data, even if reproducing the experiment is not a viable option. Finally, authentication is necessary to believe that the raw data used in the scientific work is itself valid. Without confidence in these issues, the credibility of data posted on the Internet has the same level as the typical Wikipedia article.

For example, data may be collected by sensors and downloaded to a computer, perhaps run through some scripts to perform calibration and cleaning, posting the results for public use on a website, without a scientist checking their validity. What can go wrong? An anemometer might freeze in an icestorm, reporting a windspeed of 0 incorrectly. A sensor might slip out of calibration over time, but the amount of slippage will remain unknown until the sensor is shipped back to the manufacturer for calibration tests, most likely long after the data have been made publicly available. Bugs in the software that processes the data may lead to incorrect conclusions. And so on. With the pace at which sensors produce data and programs manipulate data, it is clear that documentation of the data's provenance itself must be automated, so that there can be some hope of understanding the data and correcting for errors that arise in its collection or handling.

If you are an undergraduate interested in an interdisciplinary project involving computer science and ecology, join us for the Summer Research Program (REU) at Harvard Forest!

We have been fortunate to have had many terrific REU students work on this and related projects:

Cory Teshera-Sterne, Mount Holyoke Colege
Morgan Vigil, Westmont College
Sofiya Taskova, Mount Holyoke College
Andy Kaldunski, Ripon College
Garrett Rosenblatt, University of Rochester
Miruna Oprescu, Harvard University
Yujia Zhou, Dickinson College
Shay Adams, Mount Holyoke College
Vasco Carinhas, Universidad de Puerto Rico en Arecibo
Luis Perez, Harvard University
Nikki Hoffler, Mount Holyoke College
Lia Poulos, Mount Holyoke College
Marios Dardas, Holy Cross College
Alex Liu, Amherst College
Moe Pwint Phyu, Mount Holyoke College
Connor Gregorich-Trevor, Grinnell College
Jen Johnson, Middlebury College
Orenna Brand, Columbia University
Joe Wonsil, Carthage College

This material is based upon work supported by the National Science Foundation under Awards No. CCR-0205575, CCR-0427071, and IIS-0705772, the National Science Foundation REU grants DBI-0452254 and DBI-1003938, the Mount Holyoke Center for the Environment, and the Charles Bullard Fellowship at Harvard University and is a contribution from the Harvard Forest Long-Term Ecological Research (LTER) program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, Harvard University or Mount Holyoke College.

blerner@mtholyoke.edu
October 9, 2018