Capturing, Visualizing and Querying Scientific Data Provenance

Leon Osterweil
Lori Clarke
University of Massachusetts, Amherst
Barbara Lerner
Mount Holyoke College
Emery Boose
Aaron Ellison
Harvard Forest
Harvard Forest REU

Technology continues to change the way that scientists work. Ubiquitous sensors and wireless networks enable the collection of vast quantities of data at a very fast rate. Scientific programs, ranging from Excel spreadsheets to supercomputer applications, manipulate the collected data to produce scientific results. Scientists can then disseminate both the raw and processed data quickly and to a broad, unknown audience by publishing it on their websites.

Good science requires more than results. It requires reproducibility, verifiability and authentication. Reproducibility is necessary to ensure that the results are not an accidental outcome, but the result of genuine, carefully-performed experimentation and analysis. Verifiability is necessary to assure that the results really did derive from the data, even if reproducing the experiment is not a viable option. Finally, authentication is necessary to believe that the raw data used in the scientific work is itself valid. Without confidence in these issues, the credibility of data posted on the Internet has the same level as the typical Wikipedia article.

Increasingly, data is collected by sensors, which download the data, perhaps run through some scripts to perform calibration and cleaning, posting the results for public use on a website, without a scientist checking their validity. What can go wrong? An anemometer might freeze in an icestorm, reporting a windspeed of 0 incorrectly. A sensor might slip out of calibration over time, but the amount of slippage will remain unknown until the sensor is shipped back to the manufacturer for calibration tests, most likely long after the data has been made publicly available. And so on. With the pace at which sensors produce data and programs manipulate data, it is clear that documentation of the data's provenance itself must be automated, so that there can be some hope of understanding the data and correcting for errors that arise in its collection or handling.

The Project

We are working with researchers at Harvard Forest in Petersham, Massachusetts to explore how to capture, visualize and query the provenance of the data they collect from a variety of sensors. As one example, we have worked with hydrologists to measure the movement of water through an ecosystem, accounting for precipitation, evaporation and stream flow.

We are looking at data provenance from two levels: the low-level computations typically done in R, and the higher-level coordination among scripts, which we are capturing in processes written in Little-JIL. Information is captured during execution of the Little-JIL processes and the R scripts to document the data's provenance: where did the data come from, how has it been manipulated prior to its dissemination, who was involved and when.

A wireless sensor network is currently under development for measuring real-time ecosystem water flux at the Harvard Forest Long-Term Ecological Research (LTER) site in Petersham, Massachusetts, USA. This system will integrate ongoing meteorological, hydrological, eddy flux, and tree physiological measurements. Simultaneous measurements in adjoining small watersheds will enable researchers to study variations in water flux caused by differences in topography, soils, vegetation, land use, and natural disturbance history. Frequent sampling will enable study of water flux dynamics at a wide range of temporal scales, from minutes to observe the response of evapotranspiration to light, to days to observe the response of ground water to precipitation and snow melt, to years to observe the response of an ecosystem to climate, reforestation, land use, and natural disturbance.

Little-JIL, a process programming language developed in the LASER research lab at the University of Massachusetts, Amherst, is being used to provide the coordination of the various people and software tools involved in the collection, processing and dissemination of the sensor data. Little-JIL is a graphical language designed to integrate tools and people working in a distributed computing environment, with strong support for abstraction, exception handling and resource management.

Little-JIL is a coordination language, not a computation language. To understand the details of how data is manipulated, we collect provenance from the execution of R scripts. R is a popular scripting language tailored for doing statistical data analysis.

The provenance data that we collect is stored in a database, along with snapshots of intermediate computations and the source of the R scripts that are executed. The user can then load the provenance data, perform queries on the provenance data, and view the exact data and scripts used to produce that particular data. We continue to enhance the query capabilities, are developing more sophisticated data relationships and comparison facilities to further help the scientist.

Data Derivation Graphs

The provenance data that we collect is stored in a Data Derivation Graph (DDG). The nodes of the graph represent either data or processing steps. The edges connect the data to the steps that use it as input or produce it as output. Edges also connect processing steps directly to show control flow relationsihps.

The DDG shown to the right was generated from running a simple R script demonstrating some basic steps in the use of sensor data. First, reading just the yellow nodes from top to bottom, we can see the outline of these processing steps:

  1. Read the sensor data
  2. Plot the raw data
  3. Adjust the data to account for sensor calibration
  4. Plot the calibrated data
  5. Perform quality control on the data, such as removing outliers
  6. Plot the data after quality control
  7. Fill any gaps in the data stream caused by missing data
  8. Plot the gap filled data
  9. Write the gap filled data to a file

The purple nodes represent data values that are being manipulated by the script. A data node with an arrow pointing to a processing node indicates that the data was input to the processing step. For example, the read.data step uses a start date, end date, and variable identifying which type of meterological information to extract from a larger data set. An edge from a processing step to an data node indicates that the data was calculated by that step. For example, we see that the raw.data step output the data object raw-data.csv, which is the data read in by read.data that fell within the start and end dates and for the variable of interest. This raw data becomes input to the first plot.data step as well as the calibration step.

The brown nodes identify files that either serve as input or are written as output during processing.

Green nodes identify parts of the graph that can be collapsed into a single node to create a more abstract, easier to browse version of the same graph. The DDG shown to the left is the same as the DDG shown to the right, but with the steps involved in analyze.data collapsed into a single node.

In addition to showing the flow of computation and how the output data is derived from the input data, the nodes are connected to the data and R script themselves and can thus be used to navigate to and view the data and scripts.

Available Software

DDG Explorer is a tool that allows the user to view and query the >Data Derivation Graphs (DDGs). The DDG notation is general enough to support many languages, but currently we can only create DDGs through the execution of Little-JIL processes or instrumented R scripts.

DDG Explorer has the following functionality:

To create and work with DDGs created from R scripts, you should download the RDataTracker library and DDG Explorer and their user guides.

More Information

API for working with DDGs

Barbara Lerner and Emery Boose, "RDataTracker: Collecting Provenance in an Interactive Scripting Environment", 6th USENIX Workshop on the Theory and Practice of Provenance, Cologne, Germany, June 2014. (Abstract, Paper)

Barbara Lerner and Emery Boose, "RDataTracker and DDG Explorer Capture, Visualization and Querying of Provenance from R Scripts", 5th International Provenance and Annotation Workshop (IPAW '14), Cologne, Germany, June 2014. (Paper)

Xiang Zhao, Emery R. Boose, Yuriy Brun, Barbara Staudt Lerner and Leon J. Osterweil, "Supporting Undo and Redo in Scientific Data Analysis", Workshop on the Theory and Practice of Provenance, 2013. (Abstract, )

Xiang Zhao, Barbara Lerner, Leon Osterweil, Emery Boose and Aaron Ellison, "Provenance Support for Rework", 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP '12), Cambridge, Massachusetts, June 2012. (Abstract)

Barbara Lerner, Emery Boose, Leon Osterweil, Aaron Ellison and Lori Clarke, "Provenance and Quality Control in Sensor Networks", Environmental Information Managemet 2011 Conference, Santa Barbara, California, September 2011. (Abstract) (Paper (pdf))

Xiang Zhao, Barbara Lerner, Leon Osterweil, Emery Boose, Aaron Ellison, "Provenance Support for Rework", 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP '12), Cambridge, Massachusetts, June 2012. (Abstract) (Paper (pdf))

This project was presented at the New England Undergraduate Computing Symposium (NEUCS'10).

Corietta L. Teshera-Sterne, A Software Engineering Approach to Scientfic Data Management, May 2010.

Sofiya Taskova, Capturing, Persisting and Querying the Provenance of Scientific Data, Honors Thesis, May 2012.

Miruna Oprescu, Visualization Tools for Digital Dataset Derivation Graphs, Summer 2012 REU Student

Yujia Zhou, Trees and Bugs in Computers , Summer 2012 REU Student

Snickers, The Blog of an Ecologist Dog, Summer 2012 REU Mascot

Shay Adams, "Capturing Data Provenance from R Script Execution", Summer 2013 REU

Vasco Carinhas, "Quality Control Enforcer: Making data's storyline user friendly", Summer 2013 REU

If you are an undergraduate interested in an interdisciplinary project involving computer science and ecology, join us for the REU at Harvard Forest!

This material is based upon work supported by the National Science Foundation under Awards No. CCR-0205575, CCR-0427071, and IIS-0705772, the National Science Foundation REU grants DBI-0452254 and DBI-1003938, the Mount Holyoke Center for the Environment, and the Charles Bullard Fellowship in Forest Research at Harvard Forest, a department of Harvard University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, Harvard University or Mount Holyoke College.


blerner@mtholyoke.edu
December 3, 2013