-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the Search Field mapping in ElasticSearch DocumentStore #2080
Fixed the Search Field mapping in ElasticSearch DocumentStore #2080
Conversation
if self.synonyms: | ||
mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"} | ||
mapping["settings"]["analysis"]["analyzer"]["synonym"] = {"tokenizer": "whitespace", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also set the search_fields types to text if synonyms have been provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to add the synonym analyzer for the search field as well if the synonyms are passed @tstadel? Have added the changes still
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, functionally that's ok. However now we're passing always the same dictionary as field mapping. This can be confusing if later on someone tries to change the mapping for one field only but ends up with changing all self.search_fields
mappings.
So here's what I recommend:
- get rid of the extra handling for self.content_field in
self.content_field: {"type": "text"}, mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"} - instead of line 314 loop over
self.search_fields
and add the mapping with analyzer - add an else block after synonyms handling: set the mapping without analyzer for all
self.search_fields
- and there's another thing missing I totally forgot: we have to do the same for
OpenSearchDocumentStore
, so copy anything from Line 280 till the end of the else block of the last bullet point and replacehaystack/haystack/document_stores/elasticsearch.py
Lines 1361 to 1384 in 51d15e3
mapping = { "mappings": { "properties": { self.name_field: {"type": "keyword"}, self.content_field: {"type": "text"}, }, "dynamic_templates": [ { "strings": { "path_match": "*", "match_mapping_type": "string", "mapping": {"type": "keyword"}}} ], }, "settings": { "analysis": { "analyzer": { "default": { "type": self.analyzer, } } } } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :-)
Please add the missing synonyms support and we are good to merge.
And of course tests need to pass :-)
@SjSnowball, the tests failing seems not to be a code issue. It seems like code changes do not take effect after a cache hit in CI. I had the same behavior in PR #2013. I pseudo changed setup.py by inserting a blank to work around the cache and it worked again. @ZanSara, could that be connected to the new setup mechanism? |
if self.synonyms: | ||
mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"} | ||
mapping["settings"]["analysis"]["analyzer"]["synonym"] = {"tokenizer": "whitespace", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, functionally that's ok. However now we're passing always the same dictionary as field mapping. This can be confusing if later on someone tries to change the mapping for one field only but ends up with changing all self.search_fields
mappings.
So here's what I recommend:
- get rid of the extra handling for self.content_field in
self.content_field: {"type": "text"}, mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"} - instead of line 314 loop over
self.search_fields
and add the mapping with analyzer - add an else block after synonyms handling: set the mapping without analyzer for all
self.search_fields
- and there's another thing missing I totally forgot: we have to do the same for
OpenSearchDocumentStore
, so copy anything from Line 280 till the end of the else block of the last bullet point and replacehaystack/haystack/document_stores/elasticsearch.py
Lines 1361 to 1384 in 51d15e3
mapping = { "mappings": { "properties": { self.name_field: {"type": "keyword"}, self.content_field: {"type": "text"}, }, "dynamic_templates": [ { "strings": { "path_match": "*", "match_mapping_type": "string", "mapping": {"type": "keyword"}}} ], }, "settings": { "analysis": { "analyzer": { "default": { "type": self.analyzer, } } } } }
@tstadel You are right, mappings looks confusing.
If you check the test snippet I have added, I used different fields in the search_fields and content_fields.
Yes, I have added the search_field mapping as you mentioned.
I replaced the else block. But I could see the changes are removing the content_field mapping. These changes will cause the same issue which I mentioned above right |
yes you're right with the content_field. Even if it unnecessarily increases the index size if it's not used as search_field, it should be searchable in general. That's a reasonable convention as you might want to include it into search_fields. And that should be possible without recreating the index, as content_field is naturally the field that dense retrieval happens on. So sorry for my confusion: you can revert the deletion of line 284 and line 314. That's cleaner than the check for its inclusion in searchfields. Though you might want to add a comment that content_field should always be searchable even if it currently isn't part of search_fields. Please also copy over the synonyms block to OpenSearchDocumentStore. |
@tstadel Understood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added the fix for the #2055 issue.
Proposed changes:
Once the search fields are set in the Elasticsearchdocument store initializing, those fields should be mapped as "text" fields while indexing.
Added the code which will take the search fields and add the type as "text". This matches the checks added in "_create_document_index" and makes the search fields full-text searchable.
Status (please check what you already did):