Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom id hashing on documentstore level #1910

Merged
merged 16 commits into from
Jan 3, 2022

Conversation

ArzelaAscoIi
Copy link
Member

@ArzelaAscoIi ArzelaAscoIi commented Dec 18, 2021

The ids of documents stored by the document stores are calculated using a hash of the contents or alternatively the hash of concatenated lists of strings. These lists of strings are passed by setting the argument id_hash_keys of a document. If the content of two documents is the same, the id is the same and therefore one overwrites the other. There are cases, when we want to keep these two documents, e.g. if they have different metadata.

For single documents, you can use here the parameter id_hash_keys. If you want to use this functionality on node level (document stores), then each document would receive the same id. It is therefore useful to calculate the id dynamically from the content, the metadata or other fields of the documents.

We should introduce a parameter that can be used to define, which of the attributes of a document should be used to generate the id.

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@ArzelaAscoIi ArzelaAscoIi requested a review from tholor December 18, 2021 21:48
@ArzelaAscoIi ArzelaAscoIi marked this pull request as draft December 20, 2021 08:13
tholor
tholor previously requested changes Dec 20, 2021
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Added a few smaller comments. Please also adjust this test to reflect the new behavior: /~https://github.com/deepset-ai/haystack/blob/master/test/test_schema.py#L151

@ArzelaAscoIi ArzelaAscoIi marked this pull request as ready for review December 21, 2021 12:11
@ArzelaAscoIi ArzelaAscoIi requested a review from tholor December 21, 2021 12:16

allowed_hash_key_attributes = ["content", "content_type", "score", "meta", "embedding" ]
if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArzelaAscoIi
I saw mypy having problems with this line because self.id_hash_keys could be None. In that case a TypeError: 'NoneType' object is not iterable will be raised.
The mypy error is also in the CI tests:

haystack/schema.py:98: error: Argument 1 to "set" has incompatible type "Optional[List[str]]"; expected "Iterable[str]"
Found 1 error in 1 file (checked 115 source files)
Error: Process completed with exit code 1.

Let's check for id_hash_keys not being None

@julian-risch
Copy link
Member

@ArzelaAscoIi Related to my other comment, I found usages of id_hash_keys in the tests of the document stores, which need to be adapted.
For example, here:

id_hash_keys=["key1"]

if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes):
raise ValueError(f"You passed custom strings {id_hash_keys} to id_hash_keys which is deprecated. Supply instead a list of Document's attribute names that the id should be based on (e.g. {allowed_hash_key_attributes}). See /~https://github.com/deepset-ai/haystack/pull/1910 for details)")
self.id_hash_keys = id_hash_keys
if self.id_hash_keys != None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is not is preferred over != when comparing to None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a brief explanation of the reasons: https://stackoverflow.com/a/2209781

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

raise ValueError(f"You passed custom strings {id_hash_keys} to id_hash_keys which is deprecated. Supply instead a list of Document's attribute names that the id should be based on (e.g. {allowed_hash_key_attributes}). See /~https://github.com/deepset-ai/haystack/pull/1910 for details)")
self.id_hash_keys = id_hash_keys
if self.id_hash_keys != None:
if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this line is still failing with mypy I'd say we put a # type: ignore here at the end of the line and maybe another comment explaining that mypy fails because it thinks set(None) might be called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I added a comment at that line :)

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for opening a separate issue on whether to never set id_hash_keys as an attribute of document.

@julian-risch julian-risch dismissed tholor’s stale review January 3, 2022 15:22

I am dismissing the change requests by @tholor as they are all addressed now. I did another review of the current state and we don't want to wait with merge due to dependencies in DC..

@ArzelaAscoIi ArzelaAscoIi merged commit 6e8e3c6 into master Jan 3, 2022
@ArzelaAscoIi ArzelaAscoIi deleted the pass_id_hash_keys_on_documentstore_level branch January 3, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants