-
Notifications
You must be signed in to change notification settings - Fork 36
Automated creation of a new Versioned Dataset
- (for distinction among Abstract Dataset, Versioned Dataset, and Layer Dataset, see Springer LOGD book chapter (in press)).
- Directory structure as described in Conversion process phase: retrieve.
- Triggers are used to automatically create new datasets.
- Aggregating subsets of converted datasets uses the automation described here.
- Secondary Derivative Datasets uses the automation described here.
- The construction of a new dataset should conform to the Directory Conventions.
This page describes how to set up datasets so that others can recreate them from their original sources.
Going into a dataset's version/
directory:
$ cd /source/twc-rpi-edu/instance-hub-us-states-and-territories/version/
$ ls
we see a directory for each version of abstract dataset http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-us-states-and-territories:
2011-Apr-01/
2011-Apr-09/
2011-Mar-31/
We can set up future versions by creating a script retrieve.sh
with contents:
#!/bin/bash
#
#3> <>
#3> rdfs:comment
#3> "Script to retrieve and convert a new version of the dataset.";
#3>
#3> rdfs:seeAlso
#3> </~https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3> </~https://github.com/timrdf/csv2rdf4lod-automation/wiki/tic-turtle-in-comments>;
#3> .
export CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"
$CSV2RDF4LOD_HOME/bin/util/google2source.sh -w t9QH44S-_D6-4FQPOCM81BA auto
Running google2source.sh will describe it's usage. The -w
indicates to actually create the version directory (instead of a dry run), t9QH44S-_D6-4FQPOCM81BA
is the Google spreadsheet key that can be copied from the URL when viewing it, and auto
says to use a default name for the local file created when retrieving the spreadsheet.
Remember to chmod +x retrieve.sh
the first time, then run:
./retrieve.sh
whenever you want to create a new versioned dataset by retrieving another copy of the Google spreadsheet. When doing so, the initial raw conversion will be run automatically and any enhancement conversions will be run if the [global enhancement parameters are in place](Reusing enhancement parameters for multiple versions or datasets) (e.g., /source/twc-rpi-edu/instance-hub-us-states-and-territories/version/
).
If global enhancement parameters are established and the raw layer is useless to you, include CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER:
If Google returns an empty row before the header, use conversion:HeaderRow.
If we have a data download URL and determined the source identifier and the dataset identifier with which we want to organize its RDF conversion, we can use describe it in DCAT and let cr-retrieve.sh act upon the access metadata to set up the directory structure and convert. This is done with cr-dcat-retrieval-url.sh
> cr-pwd.sh
source/
> mkdir -p cms-gov/hhs-documentation
> cd cms-gov/hhs-documentation
Make a file dcat.ttl
to contain similar to the following (change the Distribution name and the download URL):
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .
@prefix : <http://purl.org/twc/health/id/> .
<http://purl.org/twc/health/source/cms-gov/dataset/hha-documentation>
a void:Dataset, dcat:Dataset;
conversion:source_identifier "cms-gov";
conversion:dataset_identifier "hha-documentation";
prov:wasDerivedFrom :as_a_csv_2012-09-29cms-gov-hha-documentation;
.
:as_a_csv_2012-09-29cms-gov-hha-documentation
a dcat:Distribution;
dcat:downloadURL <http://www.cms.gov/Research-Statistics-Data-and-Systems/Files-for-Order/CostReports/DOCS/HHA-DOCUMENTATION.zip>;
.
#3> <> prov:wasAssociatedWith <http://tw.rpi.edu/instances/TimLebo> .
In the twc-lobd svn, going to a dataset's version/
directory:
$ cd /source/ncbi-nlm-nih-gov/gene2ensembl/version/
$ ls
we see a directory for each version of abstract dataset http://health.tw.rpi.edu/source/ncbi-nlm-nih-gov/dataset/gene2ensembl:
2011-Apr-16/
We can set up future versions by creating a script retrieve.sh
with contents (now version controlled):
#!/bin/bash
#
#3> @prefix doap: <http://usefulinc.com/ns/doap#> .
#3> @prefix dcterms: <http://purl.org/dc/terms/> .
#3> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#3>
#3> <#>
#3> a doap:Project;
#3> dcterms:description
#3> "Script to retrieve and convert a new version of the dataset.";
#3> rdfs:seeAlso
#3> </~https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>;
#3> .
export CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"
$CSV2RDF4LOD_HOME/bin/cr-create-versioned-dataset-dir.sh cr:auto \
'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz' \
--comment-character '#' \
--header-line 0 \
--delimiter '\t'
Running cr-create-versioned-dataset-dir.sh will describe it's usage:
$ cr-create-versioned-dataset-dir.sh
usage: cr-create-versioned-dataset-dir.sh version-identifier URL [--comment-character char]
[--header-line row]
[--delimiter char]
version-identifier: conversion:version_identifier for the VersionedDataset to create (use cr:auto for default)
URL : URL to retrieve the data file.
Remember to chmod +x retrieve.sh
the first time, then run:
./retrieve.sh
whenever you want to create a new versioned dataset by retrieving the data file URL. When doing so, the initial raw conversion will be run automatically and any enhancement conversions will be run if the [global enhancement parameters are in place](Reusing enhancement parameters for multiple versions or datasets) (e.g., /source/ncbi-nlm-nih-gov/gene2ensembl/version/gene2ensembl.e1.params.ttl).
Use the retrieve.sh template.
cd source/contactingthecongress/directory-for-the-112th-congress/version
cp $CSV2RDF4LOD_HOME/bin/cr-create-versioned-dataset-dir.sh retrieve.sh
These instructions repeat the instructions above, but use some automation driven by parameters available in an RDFa file on the web or local disk. If you did the stuff above, you don't need to do the stuff below. Note that this example requires rapper
, which is discussed in Installing csv2rdf4lod automation - complete.
If you'd like to get more serious and set up a data skeleton so that anybody can set up their own version of the dataset, check out Automated creation of a new Versioned Dataset.
bash-3.2$ cd ~Desktop/source
bash-3.2$ cr-create-dataset-dir.sh \
/~https://github.com/timrdf/csv2rdf4lod-automation/raw/master/bin/dup/scraperwiki-com-uk-offshore-oil-wells-2011-Jan-24.xhtml
bash-3.2$ ls -lt scraperwiki-com/uk-offshore-oil-wells/2011-Jan-24/source/
total 6856
-rw-r--r-- 1 lebot staff 3489521 Jan 24 20:54 uk-offshore-oil-wells.csv
-rw-r--r-- 1 lebot staff 3441 Jan 24 20:54 uk-offshore-oil-wells.csv.pml.ttl
-rw-r--r-- 1 lebot staff 1928 Jan 24 20:54 scraperwiki-com-uk-offshore-oil-wells-2011-Jan-24.xhtml
-rw-r--r-- 1 lebot staff 4278 Jan 24 20:54 scraperwiki-com-uk-offshore-oil-wells-2011-Jan-24.xhtml.pml.ttl
bash-3.2$ cd scraperwiki-com/uk-offshore-oil-wells/2011-Jan-24/
bash-3.2$ cr-create-convert-sh.sh -w source/uk-offshore-oil-wells.csv
bash-3.2$ ./convert-uk-offshore-oil-wells.sh
bash-3.2$ vi automatic/uk-offshore-oil-wells.csv.raw.ttl
cr-retrieve.sh uses cr-idempotent.sh to determine if it is safe to rerun the retrieval.
- Triggers is the more general pattern for reproducibility using csv2rdf4lod-automation.
- Reusing enhancement parameters for multiple versions or datasets
- Script: google2source.sh
- Short descriptions for csv2rdf4lod's scripts
- CSV2RDF4LOD_CONVERT_EXAMPLE_SUBSET_ONLY to reduce conversion processing time and output file size.
- Script: cr-test-conversion.sh to setup and invoke automated testing to verify the conversion output.