Provenance and Quality Control in Sensor Networks

Barbara Staudt Lerner
Computer Science Department
Mt. Holyoke College

Emery Boose, Aaron Ellison
Harvard Forest, Harvard University

Leon Osterweil, Lori Clarke
Computer Science Department
University of Massachusetts, Amherst


Scientists and society increasingly rely on streaming data from electronic sensors to assess, model, and forecast envi- ronmental changes. Because analyses of time-series data require uninterrupted data streams or datasets, scientists regularly fill gaps in the data by substituting modeled values. As modeling increases in complexity, the provenance metadata needed to describe and define processes used to model data and create derived datasets quickly exceeds the capacity of individual flags or groups of flags to annotate individual data values. In theory, necessary provenance metadata could be captured in narrative form, but the time and effort required to do so are prohibitive. A system that can capture provenance metadata automatically and allow scientists to query them for useful details is what scientists really need. In this paper we describe a system that uses Little- JIL, a process programming language, to rigorously define mod- eling and data-derivation processes, and a mathematical graph structure – a Data Derivation Graph (DDG) – that precisely describes execution histories. Our system and approach support understanding the (potentially) different processes used to create data values, reasoning about the soundness of these processes, and helping to ensure that the data processing in sensor net- works is reliable and reproducible.