-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom id hashing on documentstore level #1910
Custom id hashing on documentstore level #1910
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Added a few smaller comments. Please also adjust this test to reflect the new behavior: /~https://github.com/deepset-ai/haystack/blob/master/test/test_schema.py#L151
haystack/schema.py
Outdated
|
||
allowed_hash_key_attributes = ["content", "content_type", "score", "meta", "embedding" ] | ||
if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ArzelaAscoIi
I saw mypy having problems with this line because self.id_hash_keys
could be None. In that case a TypeError: 'NoneType' object is not iterable
will be raised.
The mypy error is also in the CI tests:
haystack/schema.py:98: error: Argument 1 to "set" has incompatible type "Optional[List[str]]"; expected "Iterable[str]"
Found 1 error in 1 file (checked 115 source files)
Error: Process completed with exit code 1.
Let's check for id_hash_keys not being None
@ArzelaAscoIi Related to my other comment, I found usages of haystack/test/test_document_store.py Line 43 in a619fae
|
haystack/schema.py
Outdated
if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes): | ||
raise ValueError(f"You passed custom strings {id_hash_keys} to id_hash_keys which is deprecated. Supply instead a list of Document's attribute names that the id should be based on (e.g. {allowed_hash_key_attributes}). See /~https://github.com/deepset-ai/haystack/pull/1910 for details)") | ||
self.id_hash_keys = id_hash_keys | ||
if self.id_hash_keys != None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is not
is preferred over !=
when comparing to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a brief explanation of the reasons: https://stackoverflow.com/a/2209781
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
haystack/schema.py
Outdated
raise ValueError(f"You passed custom strings {id_hash_keys} to id_hash_keys which is deprecated. Supply instead a list of Document's attribute names that the id should be based on (e.g. {allowed_hash_key_attributes}). See /~https://github.com/deepset-ai/haystack/pull/1910 for details)") | ||
self.id_hash_keys = id_hash_keys | ||
if self.id_hash_keys != None: | ||
if not set(self.id_hash_keys) <= set(allowed_hash_key_attributes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this line is still failing with mypy I'd say we put a # type: ignore
here at the end of the line and maybe another comment explaining that mypy fails because it thinks set(None) might be called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I added a comment at that line :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for opening a separate issue on whether to never set id_hash_keys
as an attribute of document.
I am dismissing the change requests by @tholor as they are all addressed now. I did another review of the current state and we don't want to wait with merge due to dependencies in DC..
The ids of documents stored by the document stores are calculated using a hash of the contents or alternatively the hash of concatenated lists of strings. These lists of strings are passed by setting the argument
id_hash_keys
of a document. If the content of two documents is the same, the id is the same and therefore one overwrites the other. There are cases, when we want to keep these two documents, e.g. if they have different metadata.For single documents, you can use here the parameter
id_hash_keys
. If you want to use this functionality on node level (document stores), then each document would receive the same id. It is therefore useful to calculate the id dynamically from the content, the metadata or other fields of the documents.We should introduce a parameter that can be used to define, which of the attributes of a document should be used to generate the id.
Status (please check what you already did):