Skip to content

Version control strategies: only the essential minimum is needed

Tim L edited this page Jun 21, 2014 · 91 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](/~https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

See https://help.github.com/articles/working-with-large-files

If you are consuming RDF produced by csv2rdf4lod-automation, you can and should have the option to reproduce that exact RDF on your own, for our own purposes, without relying upon those that provided you the data.

By using declarative, RDF-encoded conversion parameters, we can share a concise description about how to interpret tabular data provided by another source (without using custom software). By following the Directory Conventions that organize others' data by the source, dataset, and version, anyone can share the skeleton of their collection using version control systems such as svn, git, or mercurial.

When the data skeleton is available in a version control system, anyone can reconstruct the RDF data by checking out the skeleton and invoking csv2rdf4lod.

The Linked Open Biomedical Data project was the first to apply version control to its [csv2rdf4lod data root](csv2rdf4lod-automation data root) and is discussed at http://code.google.com/p/twc-lobd/. The Linking Open Government Data project is also using the same version control strategy for its data. See csv2rdf4lod in use for a list of other repositories (SVN, git, etc) that contain [data roots](csv2rdf4lod-automation data root).

This page shows how to consume a csv2rdf4lod data root that is under version control. Although the examples use svn, similar operations can be applied using other version control systems. We'll cover:

  • Checking out only the "tip" of the data skeleton (to list the sources)
  • Checking out a single dataset
  • Reconstructing the RDF conversion from source.
  • New considerations for the CSV2RDF4LOD environment variables
  • Considerations for what to put under version control
  • Suggestions for how to organize your data project on your deployment server
  • Transitioning data downloads and conversions to a new server

Checking out only the "tip" of the data skeleton

The following command will create your working copy of LOGD's full data skeleton without obtaining any dataset-specific data, enhancement parameters, documentation, or custom code. This is a low-bandwidth way to effectively "list the source organizations". The directories appearing in the source/ directory correspond to conversion:source_identifiers of the datasets available in this data root.

$ mkdir escience-logd-svn
$ cd escience-logd-svn
$ svn checkout --depth=immediates https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source
A    source/opengov-se
A    source/env-gov-bc-ca
A    source/toronto-ca
A    source/data-octo-dc-gov
...
A    source/london-ca
A    source/readme.txt
A    source/portalu-de
Checked out revision 3549.

Then, when you want just a certain dataset:

$ cd escience-logd-svn/source
$ svn update --depth=infinity data-gov/1554

Or, if you want just a certain version of a certain dataset:

$ cd escience-logd-svn/source
$ svn update --depth=infinity data-gov/1554/version/2011-Sep-14

Checking out a single dataset

Although you can check out a single dataset, it should still be situated within the [directory structure](Directory Conventions) that csv2rdf4lod-automation expects. The following pair of commands can be used to fulfill this expectation.

$ cd escience-logd-svn
$ svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov/4383 \\
                                                                             source/data-gov/4383

Going in to take a look, we see the retrieval script that will fetch the original data from data.gov, some dataset-specific environment variables (that avoid making a raw conversion and keep the result uncompressed), the enhancement parameters that describe how to convert the table to good RDF, and a cached version from October that keeps a copy from data-gov (in case they remove or change it).

bash-3.2$ cd source/data-gov/4383/version/
bash-3.2$ ls
retrieve.sh
csv2rdf4lod-source-me.sh
e1.params.ttl
2010-Oct-22/

The retrieve.sh follows the convention for Automated creation of a new Versioned Dataset.

After retrieving and converting it, you can use the version-controlled unit tests to verify that your results suit the curators' original intent. Unit testing follows the convention described by cr test conversion.sh. The output for testing dataset 4383 is shown below (add --verbose for more output).

bash-3.2$ cr-test-conversion.sh --setup
...
Load: publish/data-gov-4383-2011-Dec-05.ttl
Add: 50,000 triples  (Batch: 11,579 / Run: 11,579)
Add: 100,000 triples  (Batch: 23,651 / Run: 15,547)
110,412 triples: loaded in 7.3 seconds [15,189.4 triples/s]
                  ../../rq/test/ask/present/alabama-lod-linked-directly-referenced.rq Ask => Yes
                  ../../rq/test/ask/present/alabama-lod-linked-indirectly-referenced.rq Ask => Yes
--------------------------------------------------------------------------------
2 of 2 passed

Version controlling CSV2RDF4LOD environment variables

When installing csv2rdf4lod automation, a single csv2rdf4lod-source-me.sh is created for you to establish the CSV2RDF4LOD environment variables each time you start a shell. When using this on your own, it's nice to have a "one-stop shop" for where your environment variables are set. But as you start using csv2rdf4lod for more projects, on more machines, in coordination with other developers, different variables will begin to apply in different circumstances. In these more complex environments, you'll need to consider CSV2RDF4LOD environment variables (considerations for a distributed workflow).

To add source data or not to add source data

  • Q1: How many person-hours has your organization spent with this dataset?
    • A lot? lean to add.
  • Q2: How many applications are using the resulting data?
    • A lot? lean to add.
  • Q3: How likely will anyone be able to re-obtain the same source data in the future? How ephemeral is it?
    • Very likely? lean to NOT add.
  • Q4: How big is the data?
    • Small? lean to add.
  • Q5: Will your data consumers believe you?
    • Totally? lean to NOT add. (the RDF you're serving up in an isolated named graph is Good 'Nuff)
    • Sorta? lean to add.
    • Impossible! lean to NOT add. (you need to use Automated creation of a new Versioned Dataset to let them recreate it from scratch for themselves).
  • Q6: Are you giving your consumers exactly what they need and want?

Project-level directory organization

We recommend organizing your data and dependencies using the following directory structure. Replace logd with your project name and lebot, difranzo, and wangp for usernames of your development team.

# This is where virtuoso gets installed.
/opt/virtuoso 

# This is from a git clone; updates via git pull
/opt/csv2rdf4lod-automation 

# what software DOES logd offer?
/opt/logd/ 

# contains port/user/pass of virtuoso endpoint.
/srv/logd/data/csv2rdf4lod-source-me-for-virtuoso-credentials.sh (SOFT link to config/...r-virtuoso-credentials.sh)

# Working copy of the PUBLIC version controlled data root
/srv/logd/data/source/ 

# Here to be in your face, maximize reproducibility while minimizing "steps to setup"
/srv/logd/data/source/*source-me*.sh 

# data/source is the production data root, from which conversions get deployed
/srv/logd/data/source/data-gov/92

# data/dev contains development working copies of the same data root in the Public SVN
/srv/logd/data/dev/lebot/source/data-gov/92 

# data/dev contains development working copies of the same data root in the Public SVN
/srv/logd/data/dev/difranzo/source/census-gov/nutrition 

# data/dev contains development working copies of the same data root in the Public SVN
/srv/logd/data/dev/wangp/source/epa-gov/nwis 

# under PRIVATE version control (b/c it has user/pass)
/srv/logd/config/triple-store/virtuoso/csv2rdf4lod-source-me-for-virtuoso-credentials.sh 

# a SOFT link for self-contained documentation
/srv/logd/config/triple-store/virtuoso/virtuoso.ini 
/srv/logd/config/triple-store/virtuoso/isql 
/srv/logd/config/triple-store/virtuoso/virtuoso.log

Transitioning data downloads and conversions to a new server

If you've created a [csv2rdf4lod data root](csv2rdf4lod automation data root) on a local machine without version control and want to deploy it to a server by means other than version control, you can follow these suggestions. However, we encourage you to use version control so that you can avoid headaches as your project becomes larger and more people begin to expect things from it on a consistent basis. The last thing you want to start wrestling with as a deadline approaches is duplicate directory structures and managing which-is-the-latest syndrome. Use version control. Start using version control, yesterday.

First, create a place on the server for your data root. The recommended location is /srv/your-project-name/data/:

server:$ mkdir -p /srv/tim-test-project/data/
server:$ chown -R lebot:tw /srv/tim-test-project

Go to the directory containing your local data root. Here, we look into one of the conversion cockpit's output directory to make sure the source/ directory is a data root following the directory conventions:

local:$ ls source/data-gov/4383/version/2011-Dec-05/automatic/
4383.xls.csv.e1.sample.ttl
4383.xls.csv.e1.ttl
4383.xls.csv.e1.void.ttl
4383.xls.csv.raw.params.ttl

Consider using $CSV2RDF4LOD_HOME/bin/util/cr-trim-reproducible-output.sh to reduce the data transfer size. If you do, you'll need to reproduce the RDF conversion using the source data and enhancement parameters ($CSV2RDF4LOD_home/bin/cr-pull-conversion-triggers.sh can help).

Send your local data root to the server at server.tw.rpi.edu, into the directory you just created:

local:$ rsync -auvz source server.tw.rpi.edu:/srv/tim-test-project/data --exclude .DS_Store

Next, you can go into a conversion cockpit and publish the dataset that you already converted. First, we publish the dump files on the web. Then, we load into virtuoso.

server:$ cd /srv/tim-test-project/data/source/data-gov/4383/version/2011-Dec-05/

server:$ publish/bin/ln-to-www-root-data-gov-4383-2011-Dec-05.sh

server:$ publish/bin/virtuoso-load-data-gov-4383-2011-Dec-05.sh

Note that publishing to virtuoso requires some configuration, which is described [here](Publishing conversion results with a Virtuoso triplestore). As a summary, make sure to:

  • export CSV2RDF4LOD_CONVERT_DATA_ROOT=/srv/YOUR-PROJECT-NAME/data/source/ in your project-level csv2rdf4lod-source-me.sh as discussed here.
  • Add the path /srv/YOUR-PROJECT-NAME/data/source to the DirsAllowed parameter in your virtuoso.ini, which should be soft linked from /srv/logd/config/triple-store/virtuoso/virtuoso.ini when virtuoso was set up for your project.
Clone this wiki locally