Make Data Usable

This page describes the tasks of the working groups responsible for making BONSAI data usable.

Product System Algorithms

Product Footprints are calculated by multiplying the impacts from each activity in the database by a vector of scaling factors. These scaling vectors (collectively known as the Leontief inverse) can be calculated by matrix inversion of the Direct Requirements Table. A Direct Requirements Table is constructed from a Supply Use Table by applying product system algorithms (collectively described as a system model) for filling of data gaps, handling of constraints, elimination of by-products, and activity-specific and system-wide balancing of all balanceable properties. The product system algorithms shall guarantee product systems that are balanced physically and economically and provide comparable “product footprints”. Mis-balances are initially stored as work-items, minimising the use of re-balancing algorithms and postponing them to the end of the workflow. All applied algorithms are documented and stored under open-source license.

Dataset interaction algorithms

Priority: Intermediate;
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Stefano Merciai, Jannick Schmidt

Description of task: Algorithms handling the interaction between:

global and local datasets, e.g. production volume conflicts
datasets for different time periods for now-casting and scenarios forecasting purposes and the temporal linkinkg of data for stock modelling (see "Storing stock data" in Big Data Harvesting
the temporal linking of waste treatment datasets, e.g. handles flows between activities with different time periods
handling of temporal delays in impact pathways e.g. stocks
propagation of flow properties (propagates e.g. wood density; heating values from supplying activity to receiving activity or vice versa and handles conflicts between such property information). Handling interaction between flow-property layers
matching geographically specified data (handling how flows are specified between source and sink activities at different locations, "transforming activity - market - transforming activity" and "environmental mechanism - source/sink mixes - environmental mechanism")
procedures for generating data for future macroeconomic scenarios

Technical specifications: The diagram below represents the current diagram of the algorithm handling the dataset interactions for conversion of the EXIOBASE monetary tables into hybrid units.

diagram of current algorithm in GAMS

Questions/Discussions: -> If we want to base ourselves on supply-use data prior to any constructs, what procedures can we use for back-calculating supply-use data for countries where only direct requirement matrices are available? -> How to create more efficient algorithms? Much of machine learning relies on gradient descent algorithms that run on GPUs, and there are many packages that allow for working with matrix data directly see Torch packages maths.md.

Uncertainty propagation

Priority: Itermediate;
Estimated person-hours: ...
Volunteer(s)/Candidate(s):

Description of task: Algorithms handling the propagation of uncertainty from raw data to calculated/reconciled data and handling conflicts between flow properties arising when data propagate (e.g. wood density; heating values etc. from supplying activity to receiving activity or vice versa)

Technical specifications:

Generation of custom geographic data

Priority: Low
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Chris Mutel

Description of task: Implement tool for handling geographical location (e.g. How to define the Rest of the World? Global? or Europe without Switzerland?)

Matching geographically specified data (handling how flows are specified between source and sink activities at different locations, "transforming activity - market - transforming activity" and "environmental mechanism - source/sink mixes - environmental mechanism")

Technical specifications:

Example of a repository that contains the scripts and data needed to build a consistent topology of the world (provinces, countries, and states). It also includes the ability to define recipes to generate custom locations. The repository is a mix of SQL, bash scripts, and Python. See the file topology-journal.rst for instructions and journal of what was done and why.

Markets and marginal production mixes routines

Priority: Low
Estimated person-hours: ...
Volunteer(s)/Candidate(s): Bo, Stefan, Konstantin, Stefano

Description of task: Routines for markets and marginal production mixes. Identify marginal suppliers as weighted average of the suppliers that can change their capacity.

Technical specifications: Implement validations required to deal with all possible co-production situations. Represent marginal suppliers in matrix form to match SUT.

Data access

The query parser on the triplestore will allow users to query the database to extract the data that will be used for the Direct Requirement Matrix. The matrix is a linked version of the Supply Use Table, representing a linear, homogeneous steady-state model of the economy, where each activity has only one product output, and all product inputs to an activity is the product of another activity, thus providing a linked product system for each product of each activity. The SPARQL Protocol and RDF Query Language will be used as the query language of the Semantic Web.

Query parser on triplestore

Priority: High;
Estimated person-hours: 150
Volunteer(s)/Candidate(s): Chris Davis

Functional specifications: The query parsel should allow users to query the database to extract the data that will be used for the direct requirement matrix.

Technical specifications: Make JSON API available for querying and accessing the RDF store (see CKAN. As SQL queries relational databases, SPARQL queries RDF data. The result of a SPARQL query can be a result set (as in SQL) but it can also be RDF graph - subset of the original graph queried.

For practicality and speed, it may be preferable to place a Solr instance on top of the triple store.

Beta-version:

Extract data from SUT, using graphical user interface with pre-defined queries.
It should be possible to filter data by license type, in order to look only for open-source information, or only data by a particular user (include reprocessing of data after filtering)
BONSAI data should be as open as possible, but allowing for more restrictive datapoints as well, so that the license applicable to any query result will depend on the datapoint with the most restrictive license. A query can then filter out those data points that have too restrictive a license for the intended application. The practical implementation of this requires a definite hierarchy of license types.

Test: Using R, the WIOD database files were converted to RDF and, placing a Fuseki server in front, it is possible to query on the RDF database.

Demonstrate SPARQL queries on RDF store:

SPARQL by example by Cambridge Semantics
Using SPARQL with Enipedia
SPARQL Queries for Statistics
The speed could be a limitation but there are possibilities to reduce the file size.

Natural language interface

The natural language interface, also called the footprint query interface, is an Application Programming Interface (API). It should allow users to obtain footprints from a free text search. To allow for more speedy access, the pre-calculated footprints will be stored in a separate database (could be a relational database). The user should be able to ask a simple free text question like "What is best: A match or a lighter?" The interface then will use fuzzy text matching to find the most likely corresponding pre-calculated footprints for a match and a lighter, using default (functional) units. This implies that semi-automatic matching routines will be developed to assist users in matching missing queried objects with the closer proxy available in the database. The result page will provide a visualisation of the product externalities as a percentage of the product’s price. The footprints query system will also be made available through a mobile phone application.

Priority: High
Estimated person-hours: 200h according to Eaternity's "Minimal Viable Product" (MVP) (This corresponds to 25 Storypoints (SP) each one corresponding to a day)
Volunteer(s)/Candidate(s): (preliminary) Manuel Klarmann, Dominik Stefancik, Matthias Munder, Jens Hinkelmann

Beta version:

Free text interface with query parser on pre-calculated footprints
Match between user queries and the database available products.
Allow users to query the database using natural language free-text search

It provides, for queries concerning a single product, externalities expressed as a percentage of the price of the product (price = 100%, externality = e.g. 43% on top of price), with uncertainty, presented graphically.

*Functional specifications: Interface to show product comparison (e.g. "match vs lighter") For graphics:

visualizing the two products side by side.
visualizing one product in comparison with the average of all products of the same category (e.g. human vs average of country)
visualizing one product in comparison with the best performing 20% of all products of the same category

Requirements beyond the beta version

Auto-generated suggestions for improving a specific search may be proposed to the user, such as specifying units, or specifying geography and time.
Graphical tools for the Direct Requirement Matrix and Leontief inverse matrix, providing nice visualisations, like Sankey diagrams (examples by Brandon Kuczenski or Google or pymrio (use initially existing tools to the extent possible).
Important challenge is to visualise the meaning of negative flows in an intuitive way, e.g. by using name changes instead of sign changes in the user interface.
Allow users to extract data from the SUT, using graphical user interface with pre-defined queries. Users only to see the download link.
For product comparison, if uncertainty allows a clear statement, provide which option has the smallest footprint (understood as social cost), possibly with an additional graphics as above. If uncertainty does not allow a clear statement: uncertainty is too large to say which one has the lowest smallest footprint, the graph should show an indication of the largest contributor to the uncertainty.
Graphical access to Direct Requirement Matrix and Leontief inverse matrix, using existing user friendly visualizations, like Sankey diagrams or pymrio, visualize the meaning of negative flows in an intuitive way.
See Research Data Alliance recommendations on Data citation of evolving data.
Introduce BONSAI core data and product footprint data into wikipedia
Specifications: BONSAI could be linked to other general data providers like Wolfram Alpha (open for any additional data source).

Data download

Description of task: Both the raw data and the calculated product footprints shall be available to the general public in an easily accessible way (e.g. CSV). The interface should also allow to download the entire database. Users only see the download link.

Technical specifications:

User communities

The purpose is to support specific user communities, both within the existing LCA/footprint community and outside (for example in the material flow and economic modelling communities), in designing web-tools for interacting with the basic data of the BONSAI database. A higher number of users increases the interest in maintaintn and improving the database.

The aim is to allow users to specify additional data filters and system algorithms for specific needs (for example, using only data that have been additionally reviewed, algorithms that support specific legal requirements, and commercial add-ons to the otherwise Open Source database).

Address CGE modellers: cross-elasticities can be interesting, dynamic aspects, scenario vs. short-term perturbation
Address detailed IAM modellers: MARKAL, IMAGE.