Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that parent datasets are present in the same environment as all its children #152

Closed
3 tasks
mortenwh opened this issue Apr 11, 2023 · 14 comments · Fixed by #155
Closed
3 tasks
Assignees

Comments

@mortenwh
Copy link
Collaborator

mortenwh commented Apr 11, 2023

The metadata_identifier element in the MMD-files have identifiers on the form naming_authority:uuid.

  • The parent-child relationships defined in child datasets must be updated to the correct naming_authority:uuid of the parent

  • Because of the way the discovery metadata databases are designed, both parent and child datasets should be in the same environment. Therefore we need to update the naming_authority parts of the id's pointing to parents, both for existing datasets and those coming into dmci. I.e., DMCI must update the naming_authority in the same way as it is doing for the metadata_id, by adding "-dev" or "-staging".

  • Since the parents need to be present in the same environment, we also need to check that the referenced parent exists, and reject the MMD file if the referenced parent does not exist.

@charlienegri
Copy link
Contributor

charlienegri commented Apr 12, 2023

I have an idea how to implement the first part (second point) because it's very close to what I did for the metadata_identifier:

basically we re.search for a block <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> and if there is one we replace the namespace inside it, whatever that is, with the same + .dev or .staging if dev or staging
so that for example it ends up something like <mmd:related_dataset relation_type="parent">test.no.dev:64db6102-14ce-41e9-b93b-61dbb2cb8b4e</mmd:related_dataset> (sounds ok @magnarem ?)

but I am not sure how to go about with the check that the parent is actually present in the database...
would it be requiring like a curl query on csw based on the uuid (like the one in the example Shamly wrote for the docs)?

@mortenwh
Copy link
Collaborator Author

yes, the last one is a bit tricky - especially since we have three metadata stores (although solr is not yet in place). I guess we'll need to search all three (two now)? The different stores also must be searched in different ways, so we should implement a search function in each distributor. And we need to ensure that each distributor has a search function. This could probably be handled like for the run function in distributor.py:

def search(self, id):
        """The main search function to be implemented in each subclass."""
        raise NotImplementedError

@magnarem
Copy link
Contributor

@charlienegri. This sounds as a good implementation. Yes the querying of the database can be done using the examples that Shamly wrote for the docs.

@charlienegri
Copy link
Contributor

charlienegri commented Apr 12, 2023

yes, the last one is a bit tricky - especially since we have three metadata stores (although solr is not yet in place). I guess we'll need to search all three (two now)? The different stores also must be searched in different ways, so we should implement a search function in each distributor. And we need to ensure that each distributor has a search function. This could probably be handled like for the run function in distributor.py:

def search(self, id):
        """The main search function to be implemented in each subclass."""
        raise NotImplementedError

ah, this means that the check will not be part of validate in worker.py as the namespaces+landing page replacements, as I was thinking... makes sense but I am not sure I understand how

    def run(self):
        """The main run function to be implemented in each subclass."""
        raise NotImplementedError

is meant to be working , I guess we can discuss next week about it

@shamlymajeed
Copy link
Contributor

shamlymajeed commented Apr 12, 2023

In the below code from worker.py we need to call obj.search() also while validating

if obj.is_valid():
                obj_status, obj_msg = obj.run()

Is it like that @mortenwh , @charlienegri ?

@charlienegri
Copy link
Contributor

charlienegri commented Apr 12, 2023

but we want to run the search before we validate or, better, validation is dependent on the search being successful (IF the is a <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> block, that is ), right?

@mortenwh
Copy link
Collaborator Author

@charlienegri

ah, this means that the check will not be part of validate in worker.py as the namespaces+landing page replacements, as > I was thinking... makes sense but I am not sure I understand how

The search function must be implemented in distributor.py but it must be called from worker.py. So you check it in, e.g., the validate function.

@mortenwh
Copy link
Collaborator Author

Sorry - I didn't read the full thread..

but we want to run the search before we validate or, better, validation is dependent on the search being successful (IF
the is a <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> block, that is ), right?

Yes, if there is that relation type, we need to check that the parent exists. If it doesn't, validation should fail with an appropriate error message to be returned to the user.

@charlienegri
Copy link
Contributor

charlienegri commented Apr 12, 2023

mmh, why implementing the search in distributor.py then at that point?
it needs to be based on the uuid in the <mmd:related_dataset relation_type="parent"> block, which we need also to swap the content of the block if dev or stage
(something like

match_parent_block = re.search(
            b'<mmd:related_dataset relation_type="parent">(.+?)</mmd:related_dataset>',
            data
        )
found_parent_block = match_parent_block.group(1)
old_namespace, uuid = found_parent_block.split(b":").decode()
new_namespace = f"{old_namespace}.{env_string}"
new_parent_block = bytes(f"{new_namespace}:{uuid}")
data_mod = re.sub(found_parent_block, new_parent_block, data)

)
so we could define the search in the same place, in worker.py, similar to def _check_information_content(self, data):, something like def _search_parent_uuid(self, parent_uuid):...
see e.g. a draft here (the function is dummy for now) /~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/fix-issue-152/dmci/api/worker.py#L97

@mortenwh
Copy link
Collaborator Author

mmh, why implementing the search in distributor.py then at that point?

When it is implemented in distributor.py like suggested above, we make sure that the tests fail if that function is not implemented for any distributors. So we also need to implement it in each distributor, but then as customized versions. For example, the file distributor need to have a search function that is different from the search functions in the solr and pycsw distributors.

I agree that the environment customization can be done in worker.py but not the actual search function because that is specific to the distributors. I guess the search function could be called from within _check_information_content.

@mortenwh
Copy link
Collaborator Author

Also, the search is basically a dataset search, so it is enough to call the function search. It can be called like this:

parent = obj.search(parent_uuid)

and I now see that it maybe should be implemented in the distributor is_valid functions... Maybe you just have to play a bit with it. The main point is that the search method will be different between distributors.

@charlienegri
Copy link
Contributor

charlienegri commented Apr 13, 2023

ok, I see, but if we define a

    def search(self, uuid):
        """The search function for a UUID to be implemented in each subclass."""
        #raise NotImplementedError
        return True

in distributor.py then in order to be able to use a obj.search(parent_uuid) in validate() (in worker.py) we need to initialize obj in a similar way that it is done in distribute() right? there it's

for dist in self._conf.call_distributors:
            if dist not in self.CALL_MAP:
                skipped.append(dist)
                continue
            obj = self.CALL_MAP[dist](
                self._dist_cmd,
                xml_file=self._dist_xml_file,
                metadata_id=self._dist_metadata_id,
                worker=self,
                path_to_parent_list=self._kwargs.get("path_to_parent_list", None),
            )

we need a simpler call here but I guess it's inevitable to have some kind of replication with distribute because it makes sense that the parent namespace swap would happen in validate and thus the search too....
I wonder if

for dist in self.CALL_MAP:
                    obj = self.CALL_MAP[dist](self._dist_cmd)

would be enough at that point

@charlienegri charlienegri self-assigned this Apr 13, 2023
@charlienegri
Copy link
Contributor

charlienegri commented Apr 14, 2023

I have a draft which is probably still very buggy and not pretty but I would appreciate some feedback on the logic... /~https://github.com/metno/discovery-metadata-catalog-ingestor/tree/fix-issue-152
(and I have no idea how to test the search functions, which I have monkeypatched in the existing tests)

@magnarem
Copy link
Contributor

Did some tests in pycsw locally, and my assumption that the database in pycsw have some foreign-key relation to primary_key for child/parent datasets, seems to be wrong.

So pycsw does not care if I ingest a child dataset, if the parent are missing form the catalog.
This is OK, but that means probably that we will probalby not need this search check in this issue.

Lets discuss this next week.

@charlienegri charlienegri mentioned this issue Apr 18, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants