Skip to content
timrdf edited this page Jul 27, 2012 · 49 revisions

This project supports http://inference-web.org by providing version-controlled pointers to PML instances with some shell script automation to gather them for yourself. See the instance analysis effort for details. See Are my PML instances listed? to see if your PML gets slurped up in our crawl.

Ways to get at PML

PML is accessible in a variety of forms, and to gather it all we need to account for each.

  • Nested web directories
    • Web directory of documents (Li/Cynthia/eScience/linked proofs)
      • Document
  • SPARQL query for list of documents (Tim's LOGD)
    • Document
  • Subject and Predicate URIs
    • Document: SPARQL DESCRIBE to get PML (Jim's granite)
  • some input
    • Document: SPARQL (UTEP)

Nested web directories

http://inference-web.org/proofs/wino/ is both a nested and a web directory. So is http://inference-web.org/proofs/tonys.moto.stanford.edu/

wget will do this with it's --mirror option, but you have to tell it to ignore robots.txt, since inference-web.org explicitly prohibits web crawlers:

wget --mirror -e robots=off -A owl,rdf,ttl,nt --no-parent http://inference-web.org/proofs/tonys.moto.stanford.edu/

SPARQL query for list of documents

logd.rq contains the query, logd.ep names the endpoint to submit query.

(Some turmoil here. encoded in PML 'as the file' or 'separate from the file'?, RDF-type referencing for p:Information or PML-type referencing with p:hasURL or BOTH?)

Subject and Predicate URIs

some input

Notes

Since PML is RDF, we can use VoID to describe them.

Using void:vocabulary http://inference-web.org/2.0/pml if I don't know whether the instances reference PML-P, PML-J, etc.

manual vs. automatic: directly asserted vs. programmatically generated (just like in csv2rdf4lod). Note: the direction of flow is twisted differently from data aggregation use case. manual went to automatic because it was processing source into something, while here manual is going to source because the manual is being used to determine what to retrieve.

distinction: stuff I produced with what I already had vs. stuff I got from somewhere. Once you have it, you're just doing stuff with it. Getting it makes contact with the external and should be considered more carefully than just internal processing.

implementation requiring pcurl.sh and md5.sh and cache-queries.sh from csv2rdf4lod. Requires wget

Workflow

Initial DRAFT of workflow diagram:

workflow diagram

bash-3.2$ cd plunk/instances/web-directories
bash-3.2$ ./2source.sh
wget --mirror -e robots=off -A owl,rdf,ttl,nt --no-parent http://inference-web.org/proofs/wino/
wget --mirror -e robots=off -A owl,rdf,ttl,nt --no-parent http://inference-web.org/proofs/tonys.moto.stanford.edu/
wget --mirror -e robots=off -A owl,rdf,ttl,nt --no-parent http://escience.rpi.edu/2010/mlso/PML/
wget --mirror -e robots=off -A owl,rdf,ttl,nt --no-parent http://www.rpi.edu/~michaj6/escience/pml/

(run with -w or --write to invoke mirroring)