Make sure that parent datasets are present in the same environment as all its children #152

mortenwh · 2023-04-11T16:19:11Z

The metadata_identifier element in the MMD-files have identifiers on the form naming_authority:uuid.

The parent-child relationships defined in child datasets must be updated to the correct naming_authority:uuid of the parent
Because of the way the discovery metadata databases are designed, both parent and child datasets should be in the same environment. Therefore we need to update the naming_authority parts of the id's pointing to parents, both for existing datasets and those coming into dmci. I.e., DMCI must update the naming_authority in the same way as it is doing for the metadata_id, by adding "-dev" or "-staging".
Since the parents need to be present in the same environment, we also need to check that the referenced parent exists, and reject the MMD file if the referenced parent does not exist.

The text was updated successfully, but these errors were encountered:

charlienegri · 2023-04-12T04:55:20Z

I have an idea how to implement the first part (second point) because it's very close to what I did for the metadata_identifier:

basically we re.search for a block <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> and if there is one we replace the namespace inside it, whatever that is, with the same + .dev or .staging if dev or staging
so that for example it ends up something like <mmd:related_dataset relation_type="parent">test.no.dev:64db6102-14ce-41e9-b93b-61dbb2cb8b4e</mmd:related_dataset> (sounds ok @magnarem ?)

but I am not sure how to go about with the check that the parent is actually present in the database...
would it be requiring like a curl query on csw based on the uuid (like the one in the example Shamly wrote for the docs)?

mortenwh · 2023-04-12T07:52:08Z

yes, the last one is a bit tricky - especially since we have three metadata stores (although solr is not yet in place). I guess we'll need to search all three (two now)? The different stores also must be searched in different ways, so we should implement a search function in each distributor. And we need to ensure that each distributor has a search function. This could probably be handled like for the run function in distributor.py:

def search(self, id):
        """The main search function to be implemented in each subclass."""
        raise NotImplementedError

magnarem · 2023-04-12T07:56:51Z

@charlienegri. This sounds as a good implementation. Yes the querying of the database can be done using the examples that Shamly wrote for the docs.

charlienegri · 2023-04-12T08:00:16Z

yes, the last one is a bit tricky - especially since we have three metadata stores (although solr is not yet in place). I guess we'll need to search all three (two now)? The different stores also must be searched in different ways, so we should implement a search function in each distributor. And we need to ensure that each distributor has a search function. This could probably be handled like for the run function in distributor.py:
def search(self, id):
        """The main search function to be implemented in each subclass."""
        raise NotImplementedError

ah, this means that the check will not be part of validate in worker.py as the namespaces+landing page replacements, as I was thinking... makes sense but I am not sure I understand how

    def run(self):
        """The main run function to be implemented in each subclass."""
        raise NotImplementedError

is meant to be working , I guess we can discuss next week about it

shamlymajeed · 2023-04-12T08:22:05Z

In the below code from worker.py we need to call obj.search() also while validating

if obj.is_valid():
                obj_status, obj_msg = obj.run()

Is it like that @mortenwh , @charlienegri ?

charlienegri · 2023-04-12T08:29:47Z

but we want to run the search before we validate or, better, validation is dependent on the search being successful (IF the is a <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> block, that is ), right?

mortenwh · 2023-04-12T08:58:50Z

@charlienegri

ah, this means that the check will not be part of validate in worker.py as the namespaces+landing page replacements, as > I was thinking... makes sense but I am not sure I understand how

The search function must be implemented in distributor.py but it must be called from worker.py. So you check it in, e.g., the validate function.

mortenwh · 2023-04-12T09:01:42Z

Sorry - I didn't read the full thread..

but we want to run the search before we validate or, better, validation is dependent on the search being successful (IF
the is a <mmd:related_dataset relation_type="parent">[namespace]:[uuid]</mmd:related_dataset> block, that is ), right?

Yes, if there is that relation type, we need to check that the parent exists. If it doesn't, validation should fail with an appropriate error message to be returned to the user.

charlienegri · 2023-04-12T11:21:07Z

mmh, why implementing the search in distributor.py then at that point?
it needs to be based on the uuid in the <mmd:related_dataset relation_type="parent"> block, which we need also to swap the content of the block if dev or stage
(something like

match_parent_block = re.search(
            b'<mmd:related_dataset relation_type="parent">(.+?)</mmd:related_dataset>',
            data
        )
found_parent_block = match_parent_block.group(1)
old_namespace, uuid = found_parent_block.split(b":").decode()
new_namespace = f"{old_namespace}.{env_string}"
new_parent_block = bytes(f"{new_namespace}:{uuid}")
data_mod = re.sub(found_parent_block, new_parent_block, data)

)
so we could define the search in the same place, in worker.py, similar to def _check_information_content(self, data):, something like def _search_parent_uuid(self, parent_uuid):...
see e.g. a draft here (the function is dummy for now) /~https://github.com/metno/discovery-metadata-catalog-ingestor/blob/fix-issue-152/dmci/api/worker.py#L97

mortenwh · 2023-04-12T14:18:16Z

mmh, why implementing the search in distributor.py then at that point?

When it is implemented in distributor.py like suggested above, we make sure that the tests fail if that function is not implemented for any distributors. So we also need to implement it in each distributor, but then as customized versions. For example, the file distributor need to have a search function that is different from the search functions in the solr and pycsw distributors.

I agree that the environment customization can be done in worker.py but not the actual search function because that is specific to the distributors. I guess the search function could be called from within _check_information_content.

mortenwh · 2023-04-12T14:27:41Z

Also, the search is basically a dataset search, so it is enough to call the function search. It can be called like this:

parent = obj.search(parent_uuid)

and I now see that it maybe should be implemented in the distributor is_valid functions... Maybe you just have to play a bit with it. The main point is that the search method will be different between distributors.

charlienegri · 2023-04-13T06:53:25Z

ok, I see, but if we define a

    def search(self, uuid):
        """The search function for a UUID to be implemented in each subclass."""
        #raise NotImplementedError
        return True

in distributor.py then in order to be able to use a obj.search(parent_uuid) in validate() (in worker.py) we need to initialize obj in a similar way that it is done in distribute() right? there it's

for dist in self._conf.call_distributors:
            if dist not in self.CALL_MAP:
                skipped.append(dist)
                continue
            obj = self.CALL_MAP[dist](
                self._dist_cmd,
                xml_file=self._dist_xml_file,
                metadata_id=self._dist_metadata_id,
                worker=self,
                path_to_parent_list=self._kwargs.get("path_to_parent_list", None),
            )

we need a simpler call here but I guess it's inevitable to have some kind of replication with distribute because it makes sense that the parent namespace swap would happen in validate and thus the search too....
I wonder if

for dist in self.CALL_MAP:
                    obj = self.CALL_MAP[dist](self._dist_cmd)

would be enough at that point

charlienegri · 2023-04-14T09:48:10Z

I have a draft which is probably still very buggy and not pretty but I would appreciate some feedback on the logic... /~https://github.com/metno/discovery-metadata-catalog-ingestor/tree/fix-issue-152
(and I have no idea how to test the search functions, which I have monkeypatched in the existing tests)

magnarem · 2023-04-14T14:22:42Z

Did some tests in pycsw locally, and my assumption that the database in pycsw have some foreign-key relation to primary_key for child/parent datasets, seems to be wrong.

So pycsw does not care if I ingest a child dataset, if the parent are missing form the catalog.
This is OK, but that means probably that we will probalby not need this search check in this issue.

Lets discuss this next week.

johtoblan mentioned this issue Apr 11, 2023

Enforce Namespaced UUID and check if landing page is present and correct-ish #124

Closed

charlienegri self-assigned this Apr 13, 2023

charlienegri mentioned this issue Apr 18, 2023

Fix issue 152 #155

Merged

10 tasks

johtoblan closed this as completed in #155 Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure that parent datasets are present in the same environment as all its children #152

Make sure that parent datasets are present in the same environment as all its children #152

mortenwh commented Apr 11, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

mortenwh commented Apr 12, 2023

magnarem commented Apr 12, 2023

charlienegri commented Apr 12, 2023 •

edited

Loading

shamlymajeed commented Apr 12, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

mortenwh commented Apr 12, 2023

mortenwh commented Apr 12, 2023

charlienegri commented Apr 12, 2023 •

edited

Loading

mortenwh commented Apr 12, 2023

mortenwh commented Apr 12, 2023

charlienegri commented Apr 13, 2023 •

edited

Loading

charlienegri commented Apr 14, 2023 •

edited

Loading

magnarem commented Apr 14, 2023

Make sure that parent datasets are present in the same environment as all its children #152

Make sure that parent datasets are present in the same environment as all its children #152

Comments

mortenwh commented Apr 11, 2023 • edited Loading

charlienegri commented Apr 12, 2023 • edited Loading

mortenwh commented Apr 12, 2023

magnarem commented Apr 12, 2023

charlienegri commented Apr 12, 2023 • edited Loading

shamlymajeed commented Apr 12, 2023 • edited Loading

charlienegri commented Apr 12, 2023 • edited Loading

mortenwh commented Apr 12, 2023

mortenwh commented Apr 12, 2023

charlienegri commented Apr 12, 2023 • edited Loading

mortenwh commented Apr 12, 2023

mortenwh commented Apr 12, 2023

charlienegri commented Apr 13, 2023 • edited Loading

charlienegri commented Apr 14, 2023 • edited Loading

magnarem commented Apr 14, 2023

mortenwh commented Apr 11, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

shamlymajeed commented Apr 12, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

charlienegri commented Apr 12, 2023 •

edited

Loading

charlienegri commented Apr 13, 2023 •

edited

Loading

charlienegri commented Apr 14, 2023 •

edited

Loading