Skip to content

conversion:object_search

Tim L edited this page May 29, 2014 · 41 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](/~https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

  • conversion:object_search is one of many conversion:Enhancements.
  • conversion:object_search is like conversion:SubjectAnnotation because it adds triples to describe the subject(s) of the table row, but instead of explicit predicate-object pairs, conversion:object_search searches the cell value to determine the predicate and/or object of the triple describing the row subject.

What we will cover

This page describes how search a cell value to assert additional triples describing the table row subject(s).

Usage pattern 1: Reusing the cells value

Take the entire value of the cell and construct a URL with it:

      conversion:enhance [
         ov:csvCol          3;
         ...
         conversion:equivalent_property dcterms:identifier;
         conversion:range               rdfs:Literal;
         ...
         conversion:object_search [
            conversion:regex     "^(.*)$";
            conversion:predicate foaf:homepage;
            conversion:object    "http://www.ncbi.nlm.nih.gov/pubmed/[\\\\1]";
         ];

Will produce:

<http://bio2rdf.org/pubmed:11587856>
   dcterms:identifier "11587856" ;
   foaf:homepage <http://www.ncbi.nlm.nih.gov/pubmed/11587856> ;

from the line in gene2pubmed:

205920   3927647  11587856

(If you want to affect the subject of the triple, see this)

Usage pattern 2: Reusing a comma-delimited cell value

NOTE: This should only be used in degenerate cases when you can't do it with conversion:delimits_object because for some odd reason you want to keep the unparsed value around in your enhancement. conversion:delimits_object is a much more eloquent way to parse the cell value.

      conversion:enhance [ 
         ov:csvCol         14;
         ...
         conversion:object_search [
            conversion:regex     "([^,]+), ";
            conversion:predicate dcterms:subject;
            conversion:object    "[\\\\1]";
         ];
         conversion:object_search [
            conversion:regex     ", ([^,]+)$"; # If you have a single regex, feel free to email me.
            conversion:predicate dcterms:subject;
            conversion:object    "[\\\\1]";
         ];

If you're wrestling around with spacing, try the [>\\1<] [template variable](Using template variables to construct new values).

Example 1: Searching Tweets for mentions of Stocks.

After some initial enhancements, twapperkeeper's CSV row (full input file here):

High Volume Stock: stock analysis website - $ABC - http://www.dojispace.com/stock-picks/amerisourcebergen-stock-price-ABC.aspx,,timlisten27,14522987982098432,130595362,en,&lt;a href=&quot;http://www.dojispace.com&quot; rel=&quot;nofollow&quot;&gt;Stock Screener&lt;/a&gt;,http://s.twimg.com/a/1291760612/images/default_profile_0_normal.png,,0,0,Tue 14 Dec 2010 03:32:04 +0000,1292297524

can become:

stocks:tweet_14522987982098432 
   dcterms:identifier "tweet_14522987982098432" ;
   dcterms:isReferencedBy 
   <http://logd.tw.rpi.edu/source/twapperkeeper-com/dataset/stocks/version/2011-Mar-26> ;
   a stocks_vocab:Tweet , sioctypes:MicroblogPost ;
   sioc:content 
"High Volume Stock: stock analysis website - $ABC - http://www.dojispace.com/stock-picks/amerisourcebergen-stock-price-ABC.aspx" ;

But we'd like to not have to regex a tweet to find which stocks it mentions; we'd like to precompute it so we can query it as triples. This can be done with conversion:object_search, which specifies a regex to search the object of a triple, and -- for each match -- the predicate and object to assert on the original subject. (full enhancements file here.)

      conversion:enhance [
         ov:csvCol          1;
         ov:csvHeader       "text";
         conversion:domain_name "Tweet";
         conversion:domain_template "tweet_[#4]";
         conversion:equivalent_property sioc:content;
         #conversion:label   "text";
         conversion:comment "";
         conversion:range   rdfs:Literal;
         conversion:object_search [
            conversion:eg        "is website - $ABC - http:";
            conversion:regex     "\\\\$([^\\\\s]*)";
            conversion:predicate foaf:topic;
            conversion:object    "$[\\1]";
         ];
         conversion:object_search [
            conversion:regex     "\\\\$([^\\\\s]*)";
            conversion:predicate sioc:topic;
            conversion:object    "http://dbpedia.org/resource/[\\1]";
         ];
         conversion:object_search [
            conversion:regex     "\\\\$([^\\\\s]*)";
            conversion:predicate foaf:homepage;
            conversion:object    "[/sd][\\\\1]";
         ];
      ];

adds the following triples to those shown above (full output file here):

@prefix stocks_global_value: <http://logd.tw.rpi.edu/source/twapperkeeper-com/dataset/stocks/> .

stocks:tweet_14522987982098432
   foaf:topic   "$ABC" ;
   foaf:homepage stocks_global_value:ABC ;
   sioc:topic    dbpedia:ABC ;

Note that the enhancements are Using template variables to construct new values, with additional [\1] variables that result from captured groups in the regex.

Processing an annotation.

Given an "annotated cell value" that contains a long messy string prepended by a clean processable string:

[Tabels, D2R Server, Jena, Virtuoso] * Tabels (Conversion XLS to RDF) [http://idi.fu...

We want:

:row
   dcterms:description "[Tabels, D2R Server, Jena, Virtuoso] * Tabels (Conversion XLS to RDF) [http://idi.fu...";
   dcterms:references :Tabels, :D2R_Server, :Jena, :Virtuoso;
.

The following hairball of a regex doesn't handle it. Even if it is possible, it certainly requires too much expertise and time to work out.

      conversion:enhance [
         ov:csvCol          10;
         ov:csvHeader       "lod 2 (org type)";
         conversion:object_search [                      # [Tabels, D2R Server, Jena, Virtuoso]
            conversion:regex     "^\\[([^,\\]]+)[,\\]]",
                                 "[^\\]]+?, ([^,\\]]+),",
                                 "^[^\\]]+, ([^,\\]]+)\\]", "^(academic)$";
            conversion:predicate dcterms:yippie;
            conversion:object    "[/sd]org/[\\1]";
         ];
      ]; 

So we combine conversion:object_search and conversion:delimits_object to first select the cell value substring to process, then to parse that string with the given delimiter:

      conversion:enhance [
         ov:csvCol          11;
         ov:csvHeader       "lod 3 (tools)";
         conversion:object_search [
            conversion:regex           "^\\[([^,\\]]+)\\]]"; # [Tabels, D2R Server, Jena, Virtuoso]
            conversion:delimits_object ",\\s*";            # "Tabels", "D2R Server", "Jena", and "Virtuoso"
            conversion:predicate dcterms:references;
            conversion:object    "[/sd]org/[\\1]";         # <http://purl.org/twc/lodcloud/id/tool/Virtuoso> 
         ];
      ]; 

Unfortunately, this doesn't allow us to leverage the full Resource handling (such as conversion:links_via).

What is next

Clone this wiki locally