Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading of tokenizers in DPR #2755

Merged
merged 1 commit into from
Jul 4, 2022
Merged

Fix loading of tokenizers in DPR #2755

merged 1 commit into from
Jul 4, 2022

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Jul 4, 2022

Related Issue(s): #2711

Proposed changes:
This PR resets the default tokenizer class for DPR from AutoTokenizer to DPRQuestionEncoderTokenizer and DPRContextEncoderTokenizer, respectively. This is needed, as for example the spanish DPR model IIC/dpr-spanish-passage_encoder-allqa-base sets the tokenizer_class to "DPRContextEncoderTokenizer", but this can't be loaded using the AutoTokenizer. It is missing in transformers TOKENIZER_MAPPING_NAMES, resulting in the error:
Tokenizer class DPRContextEncoderTokenizer does not exist or is not currently imported.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. 👍 Let's wait for the tests before merging.

@bogdankostic bogdankostic merged commit dc48c44 into master Jul 4, 2022
@bogdankostic bogdankostic deleted the fix_dpr_tokenizer branch July 4, 2022 16:18
Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants