-
Notifications
You must be signed in to change notification settings - Fork 36
Script: cr test conversion.sh
Since csv2rdf4lod is being continually developed, it is good to use the latest and greatest version (by using git pull
). But what if some new behavior of the converter changes, producing your data differently? Well, that's a problem. And you need to know about it ASAP. Even better, I need to know about it ASAP. Ideally, I would know about the problem and fix it before I even release the next version of the converter. That way, you wouldn't have to worry about it. cr-test-conversion.sh helps you identify these problems so that you can handle them quickly. At the same time, it helps you share your explicit expectations for the converter so that I can verify that it works for you before I release another version.
Ultimately, verifying that the conversion meets your expectations makes your applications more stable.
Make sure tdbloader is installed and on your path:
$ which tdbloader
/opt/tdb/TDB-0.8.2/bin/tdbloader
From your conversion cockpit, run:
cr-test-conversion.sh --setup --verbose
This will use tdbloader
to load the publish/*
dump files into publish/tdb/
and run the unit tests at ../../rq
or rq/
.
The script $CSV2RDF4LOD_HOME/bin/util/cr-test-conversion.sh is a start at tackling this challenge. Like virtually all other cr-
scripts, it is invoked from any conversion cockpit. When invoked, it applies a variety of SPARQL queries to verify the converted data.
The testing infrastructure is currently using Jena's TDB because it lets us set up a triple store in a local directory of our choosing. See TWC's page for help installing Jena TDB. If you can successfully tdbloader
and tdbquery
, then you're good to go. (If you have a burning desire to test using other triple stores, go vote for #150)
Get into a cr:dataset directory (running cr-pwd-type.sh
says cr:dataset) and run:
/srv/logd/data/source/nycopendata-socrata-com/zip-code-breakdowns# cr-test-conversion.sh --rq
Get into a conversion cockpit and run:
/srv/logd/data/source/nycopendata-socrata-com/zip-code-breakdowns/version/2012-Apr-11# cr-test-conversion.sh
version control strategies discusses how csv2rdf4lod-automation can be used within a version control system. When using one, it becomes incredibly easy to report a bug, all one needs to do is commit the .rq
and point others to the URL of the test on the SVN web server. For example, someone could say:
Hey, this [1] doesn't work and I need it Real Soon!, it's for my demo.
With just this URL, I can run to my terminal:
$ mkdir hurry-and-fix; cd hurry-and-fix
$ svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog \
source/data-gov-au/catalog
$ cd source/data-gov-au/catalog/version/2011-Jun-27
$ export CSV2RDF4LOD_PUBLISH=true; export CSV2RDF4LOD_PUBLISH_TDB=true
$ ./convert-catalog.sh
bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
rq/test/ask/absent/subject-uri-follows-sdv-naming.rq (Ask => No)
<http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/data.gov.au/version/2011-Jun-27/thing_2> ?p ?o .
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ - - - FAIL - -
rq/test/ask/present/thing_2-keywords-parsed.rq (Ask => No)
:thing_2 dcterms:subject "Bicycles",
"Bike paths",
"Cycling",
"Transport" .
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ - - - FAIL - -
rq/test/ask/present/thing_2-keywords-unparsed.rq (Ask => No)
#http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27/
:thing_2 dgtwc:keywords "Bicycles , Bike paths , Cycling , Transport" ;
e1:keywords_tags "Bicycles , Bike paths , Cycling , Transport" .
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/ - - - FAIL - -
rq/test/ask/present/thing_2.rq (Ask => No)
:thing_2
e1:data_gov_au_category "Community , Health , Transport" ;
dgtwc:categories "Community , Health , Transport" ;
# The following two should be parsed into the three triples below:
dgtwc:category "Community",
"Health",
"Transport";
# The following two should be parsed into the three triples below:
e1:keywords_tags "Bicycles , Bike paths , Cycling , Transport" ;
dgtwc:keywords "Bicycles , Bike paths , Cycling , Transport" ;
dcterms:subject "Bicycles",
"Bike paths",
"Cycling",
"Transport" .
--------------------------------------------------------------------------------
1 of 4 passed
And I can see your new concerns!
By extending Vocabulary of Interlinked Datasets (VoID) and reusing Description of a Project (DOAP), we can model an abstract dataset that is under version control and has unit tests:
<http://logd.tw.rpi.edu/source/worldbank-org/dataset/world-development-indicators>
a conversion:AbstractDataset, void:Dataset;
a conversion:VersionControlledDataset;
doap:repository [
a doap:SVNRepository;
doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/>;
];
a conversion:UnitTestedDataset;
conversion:testable_by [
a doap:Project;
doap:developer <http://tw.rpi.edu/instances/MaryamFazel-Zarandi>;
doap:developer <http://tw.rpi.edu/instances/TimLebo>;
doap:repository [
a doap:SVNRepository;
doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/rq/>
];
];
Sometimes tests can only apply to specific versions, since they have to assume specific values for a specific data element. Although they aren't as broadly applicable, they are still useful. The following RDF encoding states A versioned dataset is under version control and has unit tests:
<http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27>
a conversion:VersionedDataset, void:Dataset;
a conversion:VersionControlledDataset;
doap:repository [
a doap:SVNRepository;
doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/>;
];
a conversion:UnitTestedDataset;
conversion:testable_by [
a doap:Project;
doap:developer <http://tw.rpi.edu/instances/YongmeiShi>;
doap:developer <http://tw.rpi.edu/instances/TimLebo>;
doap:repository [
a doap:SVNRepository;
doap:location
<https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/version/2011-Jun-27/rq/>
];
];
.
cr-test-conversion.sh --catalog -w
will write a listing that types the SPARQL-based unit test as an earl:TestCase. For example, source/worldbank-org/world-development-indicators/rq/test/list.ttl:
@prefix earl: <http://www.w3.org/ns/earl#> .
<ask/absent/impossible_series.rq> a earl:TestCase .
<ask/absent/impossible.rq> a earl:TestCase .
<ask/present/has-a-triple.rq> a earl:TestCase .
<ask/present/has-impossible_series.rq> a earl:TestCase .
<ask/present/has-a-indicator.rq> a earl:TestCase .
<ask/present/has-a-entry.rq> a earl:TestCase .
<ask/present/has-a-country.rq> a earl:TestCase .
Test results [vocabularies](RDF vocabularies used):
- http://www.w3.org/TR/EARL10/ (diagram)
- http://www.w3.org/2006/03/test-description (diagram)
- http://rdfa.digitalbazaar.com/test-suite/ RDFa tester
cr-test-conversion.sh --help
:
usage: cr-test-conversion.sh
--rq : Create initial rq/test/ask/{present,absent}/*.rq directory structure.
--setup : Run tests, populate the tdb/ beforehand.
--setup {--verbose, -v}: Run tests, populate the tdb/ beforehand, and show query contents.
: Run tests. Needs rq/test or ../../rq/test and publish/tdb/.
{--verbose, -v} : Run tests. Needs same as above. Shows the query contents while testing.
--catalog -w : Find all rq/test and create rq/test/list.ttl rdf:typing them to earl:TestCase.
--catalog : Show dryrun of finding all rq/test; print hypothetical contents of rq/test/list.ttl.
--show-catalog : Show all rq/test/list.ttl
bash-3.2$ cd /source/medicare-gov/catalog
bash-3.2$ ls
version/
bash-3.2$ cr-test-conversion.sh --rq
Creating rq/test for dataset medicare-gov catalog
rq/test/ask/present
rq/test/ask/present/a-dataset-exists.rq
rq/test/ask/absent
rq/test/ask/absent/impossible.rq
bash-3.2$ ls
version/
rq/
The two sample queries (a-dataset-exists.rq
and impossible.rq
) take the following form. If you follow this capitalization and structure, the --verbose
flag will be a little cleaner when executing the tests.
...
ASK
WHERE {
GRAPH ?g {
...
}
}
(or on another machine, according to Version control strategies: only the essential minimum is needed)
Next, we can hop into a conversion cockpit and prepare to test:
bash-3.2$ cd version/2011-Jul-18/
bash-3.2$ ls
source/
doc/
manual/
convert-catalog.sh
automatic/
publish/
bash-3.2$ export CSV2RDF4LOD_PUBLISH_TDB=true
bash-3.2$ publish/bin/publish.sh
...
WARN [main] (FactoryGraphTDB.java:241) - No BGP optimizer
Load: publish/medicare-gov-catalog-2011-Jul-18.nt
34,552 triples: loaded in 2.3 seconds [15,254.7 triples/s]
SOURCE THE my-csv2rdf4lod-source-me.sh
for the project that you are testing against. See my-csv2rdf4lod-source-me.sh.
- then reset your
CSV2RDF4LOD_HOME
CSV2RDF4LOD_CONVERT_MACHINE_URI
CSV2RDF4LOD_CONVERT_PERSON_URI
to point to your copy of the converter.
bash-3.2$ cr-test-conversion.sh
../../rq/test/ask/absent/impossible.rq Ask => No
../../rq/test/ask/present/a-dataset-exists.rq Ask => Yes
--------------------------------------------------------------------------------
2 of 2 passed
If you'd like to see a bit more, use -v
or --verbose
:
bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
../../rq/test/ask/absent/impossible.rq (Ask => No)
twi:TimLebo owl:sameAs twi:notTimLebo .
................................................................................
../../rq/test/ask/present/a-dataset-exists.rq (Ask => Yes)
?dataset a conversion:Dataset, void:Dataset .
--------------------------------------------------------------------------------
2 of 2 passed
From a conversion cockpit:
bash-3.2$ find rq
rq
rq/test
rq/test/ask
rq/test/ask/absent
rq/test/ask/absent/9-to-7.rq
rq/test/ask/present
rq/test/ask/present/0-to-2.rq
rq/test/ask/present/2-to-3.rq
rq/test/ask/present/3-to-5.rq
rq/test/ask/present/3-to-7.rq
rq/test/ask/present/5-to-1.rq
rq/test/ask/present/7-to-5.rq
export CSV2RDF4LOD_PUBLISH_TDB=true
to load the conversion into a TDB directory to query.
bash-3.2$ cr-test-conversion.sh -v
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/
rq/test/ask/absent/9-to-7.rq (Ask => Yes) - - - FAIL - - -
typed_subdivision_order_3:r40040c9reference_199_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .
................................................................................
rq/test/ask/present/0-to-2.rq (Ask => Yes)
zip-us-us:point_40040
a wgs:Point;
geonames:parentFeature <http://logd.tw.rpi.edu/id/usps-com/zip/23690>;
wgs:lat ?lat;
wgs:long ?long .
................................................................................
rq/test/ask/present/2-to-3.rq (Ask => Yes)
<http://logd.tw.rpi.edu/id/usps-com/zip/23690> geonames:parentFeature typed_place:Yorktown_VA_US .
................................................................................
rq/test/ask/present/3-to-5.rq (Ask => Yes)
typed_place:Yorktown_VA_US geonames:parentFeature typed_subdivision_order_1:VA_US .
................................................................................
rq/test/ask/present/3-to-7.rq (Ask => Yes)
typed_place:Yorktown_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .
................................................................................
rq/test/ask/present/5-to-1.rq (Ask => Yes)
typed_subdivision_order_1:VA_US geonames:parentFeature typed_country:US .
................................................................................
rq/test/ask/present/7-to-5.rq (Ask => Yes)
<http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> geonames:parentFeature typed_subdivision_order_1:VA_US .
--------------------------------------------------------------------------------
6 of 7 passed