Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that landing-page information is present in MMD and that it is a valid url #126

Closed
johtoblan opened this issue Feb 6, 2023 · 26 comments · Fixed by #146
Closed
Assignees

Comments

@johtoblan
Copy link
Collaborator

johtoblan commented Feb 6, 2023

No description provided.

@mortenwh mortenwh changed the title Check if landing-page is present and a valid url Check that landing-page information is present in MMD and that it is a valid url Feb 6, 2023
@charlienegri
Copy link
Contributor

charlienegri commented Feb 16, 2023

is something like adding a check for the landing page in full_check (/~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/tools/check_mmd.py#L281) enough? and what is the name of the attribute? I cannot find anything like landing_page or similar in either CF or ACDD so the name is probably something else.. and also it must not be a resource otherwise there would be no need to check the url, right?

@mortenwh
Copy link
Collaborator

is something like adding a check for the landing page in full_check (/~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/tools/check_mmd.py#L281) enough?

Yes, I think so. Landing page is an MMD "related_information" field with type="Dataset landing page". It is not obvious that it should be there, as it is not required, so I'm not sure if we can use the mmd_strict.xsd to check this. Probably better to add some python lines.

and what is the name of the attribute?

Example:

<related_information>
    <type>Dataset landing page</type>
    <resource>https://data.met.no/dataset/<uuid></resource>
</related_information>

The url should be pointing to our production server, or maybe be dynamically created. The latter is maybe best, but then we have to handle this in dmci itself. This could be done based on the environment (dev, staging or production) and the dataset uuid. In this case, we need to add the field to the MMD string before distribution to file, pycsw and solr.

I cannot find anything like landing_page or similar in either CF or ACDD so the name is probably something else.. and also it must not be a resource otherwise there would be no need to check the url, right?

Not sure what you mean, but I think we can only check the url pattern. Anyway, the check is maybe not that important if the landing page field is added by dmci itself.

@mortenwh mortenwh changed the title Check that landing-page information is present in MMD and that it is a valid url Make sure that landing-page information is present in MMD and that it is a valid url Feb 16, 2023
@charlienegri
Copy link
Contributor

@mortenwh
Copy link
Collaborator

Yes, it could be done in py-mmd-tools but when I think about it, it may be better in dmci since it is strongly connected to the data site/system which is "known" in the dmci service. @johtoblan - what do you think? In that case, we can close metno/py-mmd-tools#225.

* as for the url check what  I mean is that the code currently already checks for all the urls that are in the "resources" category  /~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/tools/check_mmd.py#L303

Ok, then that may be enough.

@mortenwh
Copy link
Collaborator

Discussed in review meeting now. We add data.met.no in the MMD files. This can be changed in dmci to adjust for dev and staging envs.

@mortenwh
Copy link
Collaborator

mortenwh commented Feb 23, 2023

Update: the landing page should be set by dmci for all envs (dev, staging and prod), since this is the entry point for the data management service. See metno/py-mmd-tools#225

@charlienegri
Copy link
Contributor

charlienegri commented Mar 14, 2023

should this happen only for /v1/create or /v1/insert kinda events or also for /v1/validate?
because if the former then it could be part of the _insert_update_method_post function here /~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/main/dmci/api/app.py#L113
and instead if the latter then it must be probably its own function which will be called by _validate_method_post too and not just _insert_update_method_post

@mortenwh
Copy link
Collaborator

mortenwh commented Mar 14, 2023

It should be only the insert/create and possibly the update cases.

@johtoblan
Copy link
Collaborator Author

johtoblan commented Mar 14, 2023

So you use validate, get a Everything is ok, and then it fails on update/insert?

@charlienegri
Copy link
Contributor

charlienegri commented Mar 15, 2023

mmh I do not fully understand how _validate_method_post and post_validate() work in the context of the full chain, but for now I could write a function called something add_landing_page that would be called after checks like worker.validate(data) and then takes the byte string data and the uuid (or extracts is if we do not want modify other places so that it's passed out) and adds the needed chars via regex.. (easy to do it just before the last \n</mmd:mmd>\n if there is no related_information, not sure how to handle the case where the field is already present in the xml file..)... does this make sense? or if there's a better place to do this am all ears, the problem is that if we want to modify data as Morten was explaining, none of the existing functions do that already...

@mortenwh
Copy link
Collaborator

So you use validate, get a Everything is ok, and then it fails on update/insert?

Why should it fail on update/insert?

@mortenwh
Copy link
Collaborator

mmh I do not fully understand how _validate_method_post and post_validate() work in the context of the full chain, but for now I could write a function called something add_landing_page that would be called after checks like worker.validate(data) and then takes the byte string data and the uuid (or extracts is if we do not want modify other places so that it's passed out) and adds the needed chars via regex.. (easy to do it just before the last \n</mmd:mmd>\n

It needs to be in the right place in the MMD because of how the xsd is defined. Check mmd_strict.xsdto find the right place.

if there is no related_information, not sure how to handle the case where the field is already present in the xml file..)...

I think you'll need to replace it if it's already there, or check and replace if it is incorrect.

@johtoblan
Copy link
Collaborator Author

So you use validate, get a Everything is ok, and then it fails on update/insert?

Why should it fail on update/insert?

If we add validation to the update/insert that is not present in validate

@mortenwh
Copy link
Collaborator

So you use validate, get a Everything is ok, and then it fails on update/insert?

Why should it fail on update/insert?

If we add validation to the update/insert that is not present in validate

But that shouldn't be a problem as long as we know the internal functionality is correct? We can make sure it works using unit tests..?

@charlienegri
Copy link
Contributor

charlienegri commented Mar 15, 2023

mmd_strict.xsd

you mean /tests/files/mmd/mmd.xsd?
how can it be placed in the right order agnostically of what other fields are present? like, we do not expect a predictable number of fields in the mmd files, beyond the mandatory ones.. or am I missing something?
like for example, where would it be placed in an example like tests/files/api/passing.xml?

for reference, I was assuming at the end like

  </mmd:dataset_citation>  
  <mmd:related_information>
     <mmd:type>Dataset landing page</mmd:type>  
     <mmd:resource>https://data.met.no/dataset/a1ddaf0f-cae0-4a15-9b37-3468e9cb1a2b</mmd:resource>
  </mmd:related_information>  
</mmd:mmd>

@johtoblan
Copy link
Collaborator Author

mmd_strict.xsd

you mean /tests/files/mmd/mmd.xsd? how can it be placed in the right order agnostically of what other fields are present? like, we do not expect a predictable number of fields the mmd files, beyond the mandatory ones.. or am I missing something? like for example, where would it be placed in an example like tests/files/api/passing.xml?

/~https://github.com/metno/mmd/blob/master/xsd/mmd_strict.xsd

@magnarem
Copy link
Contributor

The host url for the landingpages should also change with respect to environments.

Now new precessed mmd-files sendt to mmd-xml-dev, still have https://data.met.no/dataset/
should be https://data-test.met.no/dataset/ for dev environment and https://data-staging.met.no/dataset/ for staging environment

@charlienegri
Copy link
Contributor

yeah, that will be read from the config

@mortenwh
Copy link
Collaborator

@magnarem - should we also change the metadata_identifier from no.met:<uuid> to no.met.dev:<uuid> and no.met.staging:<uuid> for the dev and staging envs as part of this?

@charlienegri
Copy link
Contributor

as far as I understand the order in a xsd template matters only inside of the xs:sequence indicator blocks, so if there is no related_information in data already, adding it before the last </mmd:mmd> should be fine right? otherwise I can't think of an easy way of doing it that would hold for whatever possible valid xml file/data string

@mortenwh
Copy link
Collaborator

I'm not sure what you should do but you can make a unit test which checks the final MMD against mmd_strict.xsd - that will tell you if it works or not :)

@charlienegri
Copy link
Contributor

charlienegri commented Mar 16, 2023

@magnarem - should we also change the metadata_identifier from no.met:<uuid> to no.met.dev:<uuid> and no.met.staging:<uuid> for the dev and staging envs as part of this?

does this mean that we assume a new entry in the config? e.g. environment_string, something like

dmci:
  distributors:
    - file
    - pycsw
  distributor_cache: null
  rejected_jobs_path: null
  max_permitted_size: 100000
  mmd_xsl_path: null
  mmd_xsd_path: null
  path_to_parent_list: null

environment:
  environment_string: null

pycsw:
  csw_service_url: http://localhost

web_catalog:
  catalog_url: http://localhost

file:
  file_archive_path: null

?

@charlienegri charlienegri mentioned this issue Mar 20, 2023
10 tasks
@charlienegri charlienegri linked a pull request Mar 21, 2023 that will close this issue
10 tasks
@mortenwh
Copy link
Collaborator

@charlienegri - I suggest changing the config vars to something more explicit, e.g.,

environment:
  environment_string: null

to

overrides:
  namespace: null
  catalog_site_url: null

"overrides" means that we will override the content of the input MMD file.. What do you think?

@charlienegri
Copy link
Contributor

charlienegri commented Mar 21, 2023

sure @mortenwh, fine by me, the idea of having something _string is that I though we woud preserve the existing namespace and just append .dev/.staging or nothing, but I guess there's no way there can be anything else, or we want to have anything else than no.met perhaps?? if that's the case then having the whole namespace in the conf instead of just environment_string is ok I guess

@charlienegri
Copy link
Contributor

but anyway I'll change from web_catalog to overrides in the PR attached to this issue, so that it's already there for issue #147

@mortenwh
Copy link
Collaborator

Sounds good! In case others want to use dmci, I guess the full namespace (e.g., no.met.staging) is best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants