Ensure Data Quality

This page describes the tasks for the working groups responsible for ensuring the BONSAI data quality.

Contents
Global Impact Assessment
Validation
Data Review
Uncertainty

Global Impact Assessment (IA)

The impact assessment method describes the impact pathways in the form of mathematical relations between contributors (identified by the inventory data) and final impacts caused.

Import of IA models

Priority: High;
Estimated person-hours: 50
Volunteer(s)/Candidate(s): Marie de Saxcé (2.-0 team)

Functional specifications: Convert an impact model (e.g. STEPWISE) into a programming language that allow to use the model on the database. We could already now convert impact assessment factors to activity datasets, for inclusion in the Database matrix: if the STEPWISE formulas are available the factors could be calculated as extensions. At least one impact assessment method will be imported into the triple store.

Technical specifications:* Make STEPWISE formulas available to allow conversion in programming language.

** Done in Python for Stepwise 1.6, see Github BONSAI repository **

Long-term requirements:

Regionalization of Impact Assessment
Inclusion of temporal aspects of Impact Assessment
Allow for flexible system boundary between nature and culture

Validation

Priority: Intermediate;
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Stefan Pauliuk

Description of task: Investigate potential for automatic corrections routines. Design software scanning for anomalies in the datasets.

Beta-version:

Specify which validations are done during parsing of the data and implement these rules as part of the parsing.
Validation results need to be reported to the data provider in a comprehensive and useful manner
Manage provenance and traceability of edits for large datasets

Long-term requirements:

Introduce plausibility checks (cross-checks with typical or expected data)
Derive new validation rules based on previous identified errors
Translate the current GAMs validation routines into Python

Technical specifications:

ecoinvent registered more than 100 validation rules, including formatting rules, internal consistency checks, consistency against the database, mass balance checks. The procedures do not require any additional knowledge e.g. procedure to indicate and validate natural unit for each flow.
The number of checks should be reduced to lower the threshold for uploading. Define minimum validation requirements. Implement ontological validation, validation of the formal structure (string format, number format, units, ...).
First step: ontological validation, formal structure (string format, number format, units, ...)
Second step: Plausibility of dataset. Cross-check with typical or expected data. Production volume should not exceed production volume from IO background, electricity generation from fossil fuels should come with GHG emission figures, etc..
Not fully validated datasets can be 'parked' for further review.

Data Review

Priority: Intermediate
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Bo Weidema, BONSAI team

Description of task: A crucial feature to supply new data is to allow editing of user contributions. User contributions should be tagged in order to distinguish them from data scraped from external databases.

Reviews should be suggested by the software that screens the datasets detecting anomalies, (see long-term requirements of the validation package).
Once corrections are made they are recorded in a log entry.
To provide incentive to review data, a reviewer's log should also be present to record users' feedback, i.e. the user acknowledge/dis-acknowledges the dataset.
The threshold and review requirements should be defined:
- For large datasets the reviews should be performed by expert reviewers.
- Anonymous reviewing could be allowed if the impact of the dataset on the region total is below a certain threshold.
- For validation at the data point/field level, review could be a simple yes/no decision based on a gaming approach.

-> Ideas to incentivize editing:

Create a list of top contributors e.g. with company data who wanted to promote their products on BONSAI.
Relate to publications, add DOI that refers to your contribution.
Relate to ResearchGate, Github etc. where you can publish your contributions.
Create a Game With A Purpose (GWAP) to make it something ordinary people would like to work on, see Crowd science user contribution patterns and their implications.

Technical specifications:

Register multiple conflicting values for one data point with version control, any data could be mapped to a root classification (the user then defines filters, we can also define an algorithm to reach a preferred ideal version)
- The Wikidata project has an interesting way of dealing with this - "facts" are represented as claims with a "preferred" fact being shown on Wikipedia. For data, this isn't completely about revision control - not about having a final version and then going back to previous versions, but more like having branches, multiple values from different sources. What matters is filtering - how to pick the "preferred" data and be able to retrieve the other data for comparison. Ideas on how to do RDF with version control:how to implemeent semantic data versioning, semantic publication for vernacular e-science, version control for RDF triple stores, triplestore with revisions and version control strategies.
Batch editing (across activities, across flows) needs to be supported. Wikis work well for editing one page at a time; other platforms/techniques will be needed for batch editing, the R Semantic Media Wiki Bot by Chris Davis offers a solution.