Add ckanext-dcat custom profiles

- Add profiles for DCAT-AP, GeoDCAT-AP and NTI-RISP/DCAT (Spanish context). - Added new codelists generator/downloader to improve DCAT-AP mapping values.
mjanez · Aug 22, 2024 · d949571 · d949571
1 parent 34aedfb
commit d949571
Show file tree

Hide file tree

Showing 34 changed files with 4,160 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -7,22 +7,24 @@
     <a href="#configuration">Configuration</a> •
     <a href="#schemas">Schemas</a> •
     <a href="#harvesters">Harvesters</a> •
+    <a href="#dcat-profiles">DCAT Profiles</a> •
     <a href="#running-the-tests">Running the Tests</a>
 </p>
 
 ## Overview
-This CKAN extension provides functions and templates specifically designed to extend `ckanext-scheming` and includes DCAT and Harvest enhancements to adapt CKAN Schema to [GeoDCAT-AP](./ckanext/schemingdcat/schemas/geodcat_ap/es_geodcat_ap_2.yaml).
+This CKAN extension provides functions and templates specifically designed to extend `ckanext-scheming` and `ckanext-dcat` and includes RDF profiles and Harvest enhancements to adapt CKAN Schema to multiple metadata profiles as: [GeoDCAT-AP](./ckanext/schemingdcat/schemas/geodcat_ap/eu_geodcat_ap_2.yaml) or [DCAT-AP](./ckanext/schemingdcat/schemas/dcat_ap/eu_dcat_ap_2.1.yaml).
 
 > [!WARNING] 
-> Requires [mjanez/ckanext-dcat](/~https://github.com/mjanez/ckanext-dcat), [ckan/ckanext-scheming](/~https://github.com/ckan/ckanext-scheming) and [ckan/ckanext-spatial](/~https://github.com/ckan/ckanext-spatial) to work properly.
-
+> Requires [mjanez/ckanext-dcat](/~https://github.com/mjanez/ckanext-dcat) (newer releases) or [ckan/ckanext-dcat](/~https://github.com/ckan/ckanext-dcat) (stables), [ckan/ckanext-scheming](/~https://github.com/ckan/ckanext-scheming) and [ckan/ckanext-spatial](/~https://github.com/ckan/ckanext-spatial) to work properly. Also, if you want to use custom schemas with multilingualism, it is necessary to use ckanext-fluent. There is a version with corrections: [mjanez/ckanext-fluent](/~https://github.com/mjanez/ckanext-fluent)
 > [!TIP]
 > It is **recommended to use with:** [`ckan-docker`](/~https://github.com/mjanez/ckan-docker) deployment or only use [`ckan-pycsw`](/~https://github.com/mjanez/ckan-pycsw) to deploy a CSW Catalog.
 
 ![image](/~https://github.com/mjanez/ckanext-schemingdcat/assets/96422458/6b3d6fd4-7119-4307-8be7-5e17d41292fe)
 
 Enhancements:
-- Could use schemas for `ckanext-scheming` in the plugin like [CKAN GeoDCAT-AP custom schemas](ckanext/schemingdcat/schemas#readme)
+- Custom schemas for `ckanext-scheming` in the plugin like [CKAN GeoDCAT-AP custom schemas](ckanext/schemingdcat/schemas#readme)
+- [`ckanext-dcat` profiles](#dcat-profiles) for RDF serialization according to profiles such as DCAT, DCAT-AP, GeoDCAT-AP and in the Spanish context, NTI-RISP.
+- Improve metadata management forms to include tabs that make it easier to search metadata categories and simplify metadata editing.
 - Improve the search functionality in CKAN for custom schemas. It uses the fields defined in a scheming file to provide a set of tools to use these fields for scheming, and a way to include icons in their labels when displaying them. More info: [`ckanext-schemingdcat`](/~https://github.com/mjanez/ckanext-schemingdcat)
 - Add improved harvesters for custom metadata schemas integrated with `ckanext-harvest` in CKAN using [`mjanez/ckan-ogc`](/~https://github.com/mjanez/ckan-ogc).
 - Add Metadata downloads for Linked Open Data formats ([`mjanez/ckanext-dcat`](/~https://github.com/mjanez/ckanext-dcat)) and Geospatial Metadata (ISO 19139, Dublin Core, etc. with [`mjanez/ckan-pycsw`](/~https://github.com/mjanez/ckanext-pycsw))
@@ -40,17 +42,20 @@ This plugin is compatible with CKAN 2.9 or later and needs the following plugins
   ## ckan/ckanext-scheming: /~https://github.com/ckan/ckanext-scheming/tags (e.g. release-3.0.0)
   pip install -e git+/~https://github.com/ckan/ckanext-scheming.git@release-3.0.0#egg=ckanext-scheming
 
-  ## mjanez/ckanext-dcat: /~https://github.com/mjanez/ckanext-dcat/tags (e.g. 1.2.0-geodcatap)
-  pip install -e git+/~https://github.com/mjanez/ckanext-dcat.git@1.2.0-geodcatap#egg=ckanext-dcat
+  ## mjanez/ckanext-dcat: /~https://github.com/mjanez/ckanext-dcat/tags (e.g. 1.8.0)
+  pip install -e git+/~https://github.com/mjanez/ckanext-dcat.git@1.8.0#egg=ckanext-dcat
   pip install -r https://raw.githubusercontent.com/mjanez/ckanext-dcat/master/requirements.txt
 
-  ## ckan/ckckanext-spatial: /~https://github.com/ckan/ckanext-spatial/tags (e.g. v2.1.1)
+  ## ckan/ckanext-spatial: /~https://github.com/ckan/ckanext-spatial/tags (e.g. v2.1.1)
   pip install -e git++/~https://github.com/ckan/ckanext-spatial.git@v2.1.1/#egg=ckanext-spatial#egg=ckanext-spatial
   pip install -r https://raw.githubusercontent.com/ckan/ckanext-spatial/v2.1.1/requirements.txt
 
-  ## ckan/ckckanext-harvest: /~https://github.com/ckan/ckanext-harvest/tags (e.g. v1.5.6)
+  ## ckan/ckanext-harvest: /~https://github.com/ckan/ckanext-harvest/tags (e.g. v1.5.6)
   pip install -e git++/~https://github.com/ckan/ckanext-harvest.git@v1.5.6#egg=ckanext-spatial
   pip install -r https://raw.githubusercontent.com/ckan/ckanext-harvest/v1.5.6/requirements.txt
+
+  ## ckan/ckanext-fluent: /~https://github.com/mjanez/ckanext-fluen/tags (e.g. v1.0.1)
+  pip install -e git++/~https://github.com/mjanez/ckanext-fluent.git@v1.0.1#egg=ckanext-fluent
   ```
 
 ## Installation
@@ -119,13 +124,13 @@ Examples:
 
 * LOD endpoint: A Linked Open Data endpoint is a DCAT endpoint that provides access to RDF data. More information about the catalogue endpoint, how to use the endpoint, (e.g. `https://{ckan-instance-host}/catalog.{format}?[page={page}]&[modified_since={date}]&[profiles={profile1},{profile2}]&[q={query}]&[fq={filter query}]`, and more at [`ckanext-dcat`](/~https://github.com/mjanez/ckanext-dcat?tab=readme-ov-file#catalog-endpoint)
     ```yaml
-      - name: euro_dcat_ap_2_rdf
+      - name: eu_dcat_ap_2_rdf
         display_name: RDF DCAT-AP
         type: lod
         format: rdf
-        image_display_url: /images/icons/endpoints/euro_dcat_ap_2.svg
+        image_display_url: /images/icons/endpoints/eu_dcat_ap_2.svg
         description: RDF DCAT-AP Endpoint for european data portals.
-        profile: euro_dcat_ap_2
+        profile: eu_dcat_ap_2
         profile_label: DCAT-AP
         version: null
     ```
@@ -138,7 +143,7 @@ Examples:
         format: xml
         image_display_url: /images/icons/endpoints/csw_inspire.svg
         description: OGC-INSPIRE Endpoint for spatial metadata.
-        profile: spain_dcat
+        profile: es_dcat
         profile_label: INSPIRE
         version: 2.0.2
     ```
@@ -769,6 +774,74 @@ The `ckan schemingdcat` command offers utilites:
 
     ckan schemingdcat download-rdf-eu-vocabs
 
+
+## DCAT Profiles
+This plugin also contains a custom [`ckanext-dcat` profiles](./ckanext/schemingdcat/profiles) to serialize a CKAN dataset to a:
+
+**European context**:
+* [DCAT-AP v2.1.1](https://semiceu.github.io/DCAT-AP/releases/2.1.1/) (default): `eu_dcat_ap_2`
+* [GeoDCAT-AP v2.0.0](https://semiceu.github.io/GeoDCAT-AP/releases/2.0.0/): `eu_geodcat_ap_2`
+* [GeoDCAT-AP v3.0.0](https://semiceu.github.io/GeoDCAT-AP/releases/3.0.0/): `eu_geodcat_ap_3`
+
+**Spanish context**:
+* Spain [NTI-RISP v1.0.0](https://datos.gob.es/es/documentacion/normativa-de-ambito-nacional): `es_dcat`
+* Spain [DCAT-AP v2.1.1](https://semiceu.github.io/DCAT-AP/releases/2.1.1/): `es_dcat_ap_2`
+* Spain [GeoDCAT-AP v2.0.0](https://semiceu.github.io/GeoDCAT-AP/releases/2.0.0/): `es_geodcat_ap_2`
+
+To define which profiles to use you can:
+
+1. Set the `ckanext.dcat.rdf.profiles` configuration option on your CKAN configuration file:
+
+    ckanext.dcat.rdf.profiles = eu_dcat_ap_2 es_dcat eu_geodcat_ap_2
+
+2. When initializing a parser or serializer class, pass the profiles to be used as a parameter, eg:
+
+```python
+
+   parser = RDFParser(profiles=['eu_dcat_ap_2', 'es_dcat', 'eu_geodcat_ap_2'])
+
+   serializer = RDFSerializer(profiles=['eu_dcat_ap_2', 'es_dcat', 'eu_geodcat_ap_2'])
+```
+
+Note that in both cases the order in which you define them is important, as it will be the one that the profiles will be run on.
+
+### Multilingual RDF support
+To add multilingual values from CKAN to RDF, the [`SchemingDCATRDFProfile` method `_object_value](./ckanext/schemingdcat/profiles/base.py)` can be called with optional parameter `multilang=true` (defaults to `false`)). 
+If `_object_value` is called with the `multilang=true`-parameter, but no language-attribute is found, the value will be added as Literal with the default language (en).
+
+>[!TIP]
+> The custom `ckanext-dcat` profiles have multi-language compatibility, see the ckanext-dcat documentation for more information on [writing custom profiles](/~https://github.com/ckan/ckanext-dcat?tab=readme-ov-file#writing-custom-profiles).
+
+Example RDF:
+```xml
+<dct:title xml:lang="en">Dataset Title (EN)</dct:title>
+<dct:title xml:lang="de">Dataset Title (DE)</dct:title>
+<dct:title xml:lang="fr">Dataset Title (FR)</dct:title>
+```
+```json
+{
+    "title":
+        {
+            "en": "Dataset Title (EN)",
+            "de": "Dataset Title (DE)",
+            "fr": "Dataset Title (FR)"
+        }
+}
+```
+
+Example with missing language in RDF:
+```xml
+<dct:title>Dataset Title</dct:title>
+```
+```json
+{
+    "title":
+        {
+            "en": "Dataset Title"
+        }
+}
+```
+
 ## Running the Tests
 To run the tests:
 

diff --git a/ckanext/schemingdcat/codelists.py b/ckanext/schemingdcat/codelists.py
@@ -0,0 +1,176 @@
+import csv
+import requests
+from datetime import datetime
+from pathlib import Path
+import os
+import logging
+
+# third-party libraries
+from rdflib import Graph, Namespace, RDF, URIRef, Literal
+from xml.etree import ElementTree as ET
+
+from ckanext.dcat.profiles.base import (
+    RDF,
+    SKOS
+)
+
+from ckanext.schemingdcat.profiles.dcat_config import (
+    EU_VOCABS_DIR,
+    INSPIRE_CODELISTS_DIR,
+    EUROVOC
+)
+
+log = logging.getLogger(__name__)
+
+
+def load_inspire_csv_codelists():
+    # Check if the codelists directory exists
+    csv_subdir = INSPIRE_CODELISTS_DIR.joinpath("csv")
+    if csv_subdir.exists() and csv_subdir.is_dir():
+        codelist_paths = list(csv_subdir.glob("*.csv"))
+    else:
+        codelist_paths = list(INSPIRE_CODELISTS_DIR.glob("*.csv"))
+
+    codelists_dfs = {}
+
+    log.debug('INSPIRE_CODELISTS_DIR: %s', INSPIRE_CODELISTS_DIR)
+
+    # Iterate over file paths and read in data
+    for file_path in codelist_paths:
+        with file_path.open("r") as f:
+            reader = csv.DictReader(f)
+            df = list(reader)
+            file_name = file_path.stem.lower()
+            codelists_dfs[file_name] = df
+
+    # INSPIRE Codelists
+    MD_INSPIRE_REGISTER = [item for df in codelists_dfs.values() for item in df]
+
+    return {
+        'MD_INSPIRE_REGISTER': MD_INSPIRE_REGISTER,
+        'MD_FORMAT': codelists_dfs.get('file-type'),
+        'MD_ES_THEMES': codelists_dfs.get('theme_es'),
+        'MD_EU_THEMES': codelists_dfs.get('theme-dcat_ap'),
+        'MD_EU_LANGUAGES': codelists_dfs.get('languages'),
+        'MD_ES_FORMATS': codelists_dfs.get('format_es')
+    }
+
+class RdfFile:
+    def __init__(self, name, url, description, title):
+        self.name = name
+        self.url = url
+        self.title = title
+        self.description = description
+
+    def extract_description(self, rdf_content, rdf_url):
+        raise NotImplementedError
+
+    def parse_graph(self, rdf_content):
+        return Graph().parse(data=rdf_content, format='xml')
+
+    def get_label_from_uri(self, uri):
+        return uri.split('/')[-1]
+
+    def save_to_csv(self, data, filename):
+        file_path = EU_VOCABS_DIR / 'csv' / filename
+        # Remove any None elements from the data list
+        data = [d for d in data if d is not None]
+        sorted_data = sorted(data, key=lambda x: x[1])  # Sort by label (2nd column)
+        # Open the file in write mode, which will overwrite the file if it exists
+        with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
+            writer = csv.writer(csvfile)
+            writer.writerows(sorted_data)
+        log.info(f"Data extracted and saved to {file_path}")
+
+    def download_rdf(self, rdf_url):
+        try:
+            response = requests.get(rdf_url)
+            response.raise_for_status()
+            log.info(f"Successfully downloaded RDF from {rdf_url}")
+            return response.content
+        except requests.RequestException as e:
+            log.error(f"Failed to download RDF from {rdf_url}: {e}")
+            return None
+
+    def save_to_rdf(self, data, filename):
+        graph = Graph()
+        for item in data:
+            uri = URIRef(item[0])
+            label = Literal(item[1])
+            graph.add((uri, RDF.type, SKOS.Concept))
+            graph.add((uri, SKOS.prefLabel, label))
+            if len(item) > 2:
+                eu_uri = URIRef(item[2])
+                graph.add((uri, SKOS.exactMatch, eu_uri))
+
+        file_path = f"{filename}.rdf"
+        graph.serialize(destination=file_path, format='xml')
+        log.info(f"Data saved to RDF file at {file_path}")
+
+class BasicRdfFile(RdfFile):
+    def extract_description(self, rdf_content, rdf_url):
+        graph = self.parse_graph(rdf_content)
+        data = set()
+
+        for concept in graph.subjects():
+            uri = str(concept)
+            label = self.get_label_from_uri(uri)
+            if uri != rdf_url and label != self.get_label_from_uri(rdf_url):
+                data.add((uri, label))
+
+        return data
+
+class LicenseRdfFile(RdfFile):
+    def extract_description(self, rdf_content, rdf_url):
+        graph = self.parse_graph(rdf_content)
+        data = set()
+
+        for concept in graph.subjects(RDF.type, SKOS.Concept):
+            label = self.get_label_from_uri(concept)
+            eu_uri = concept
+            uri = str(graph.value(concept, SKOS.exactMatch, default=eu_uri))
+            if concept != rdf_url and label != self.get_label_from_uri(rdf_url):
+                data.add((uri, label, eu_uri))
+
+        return data
+
+class FileTypesRdfFile(RdfFile):
+    def extract_description(self, rdf_content, rdf_url):
+        graph = self.parse_graph(rdf_content)
+        data = set()
+        non_proprietary_data = set()
+        machine_readable_data = set()
+
+        for concept in graph.subjects(RDF.type, EUROVOC.FileType):
+            uri = str(concept)
+            label = self.get_label_from_uri(uri)
+            non_prop_ext = str(graph.value(concept, EUROVOC.nonPropExt, default="false"))
+
+            if uri != rdf_url and label != self.get_label_from_uri(rdf_url):
+                data.add((uri, label, non_prop_ext))
+                machine_readable_data.add((uri, label))
+                if non_prop_ext == "true":
+                    non_proprietary_data.add((uri, label))
+
+        self.save_to_csv(non_proprietary_data, "non-propietary.csv")
+        self.save_to_csv(machine_readable_data, "machine-readable.csv")
+
+        return data
+
+class MediaTypesRdfFile(RdfFile):
+    def extract_description(self, xml_content, rdf_url):
+        data = set()
+        tree = ET.ElementTree(ET.fromstring(xml_content))
+        root = tree.getroot()
+
+        for record in root.findall(".//{http://www.iana.org/assignments}record"):
+            name_elem = record.find("{http://www.iana.org/assignments}file")
+            name = name_elem.text if name_elem is not None else ""
+            label_elem = record.find("{http://www.iana.org/assignments}file")
+            label = label_elem.text if label_elem is not None else ""
+
+            if name != self.get_label_from_uri(rdf_url):
+                uri = f"http://www.iana.org/assignments/media-types/{name}"
+                data.add((uri, label))
+
+        return data
diff --git a/ckanext/schemingdcat/codelists/inspire/csv/IACSData.es.csv b/ckanext/schemingdcat/codelists/inspire/csv/IACSData.es.csv
@@ -0,0 +1,7 @@
+id,label
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/lpis,lpis
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/gsaa,gsaa
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/iacs,iacs
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/referenceParcel,referenceParcel
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/agriculturalArea,agriculturalArea
+http://inspire.ec.europa.eu/metadata-codelist/IACSData/ecologicalFocusArea,ecologicalFocusArea
diff --git a/ckanext/schemingdcat/codelists/inspire/csv/MaintenanceFrequency.es.csv b/ckanext/schemingdcat/codelists/inspire/csv/MaintenanceFrequency.es.csv
@@ -0,0 +1,13 @@
+id,label
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/continual,continual
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/daily,daily
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/weekly,weekly
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/fortnightly,fortnightly
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/monthly,monthly
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/quarterly,quarterly
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/biannually,biannually
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/annually,annually
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/asNeeded,asNeeded
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/irregular,irregular
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/notPlanned,notPlanned
+http://inspire.ec.europa.eu/metadata-codelist/MaintenanceFrequency/unknown,unknown
diff --git a/ckanext/schemingdcat/codelists/inspire/csv/OnLineDescriptionCode.es.csv b/ckanext/schemingdcat/codelists/inspire/csv/OnLineDescriptionCode.es.csv
@@ -0,0 +1,3 @@
+id,label
+http://inspire.ec.europa.eu/metadata-codelist/OnLineDescriptionCode/accessPoint,accessPoint
+http://inspire.ec.europa.eu/metadata-codelist/OnLineDescriptionCode/endPoint,endPoint