-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DPR from FB to Transformers codebase #308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggestion to make the code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still some very custom code in there that should be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made some minor suggestions (might be a bit nit picky, but lets strive for best code possible)
Lets move the HF DPR classes to dpr_utils, remove the old fb code and then this looks very much like we can merge! : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Small changes required. Mostly naming and docstrings.
Hey, a minor thing I noticed that happened to others here before. Its not a big thing but lets try to avoid this in the future: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some thoughts on loading the models with potential for improvement. Happy to start a discussion here, I might not be aware of all necessary details.
…om/deepset-ai/haystack into swtich_dpr_encoder_to_transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smaller changes required. I am also not sure yet about the handling of titles when some docs have it and some don't.
…om/deepset-ai/haystack into swtich_dpr_encoder_to_transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good now! I only updated one docstring slightly.
Breaking changes:
signature:
example call:
DPRQuestionEncoderTokenizer
andDPRContextEncoderTokenizer
example call:
signature
_tensorizer(tokenizer, title: Optional[List[str]], text: List[str], add_special_tokens: bool = True)
example call:
DPRQuestionEncoder
andDPRContextEncoder
example call:
Modified to include
titles_mask
parameter for handling batches with some missing titlesexample call:
Title inclusion in passages:
embed_title == True
:- No titles in batch: no
"name"
key in anyDocument
in batch or nometa
attribute =>titles: List["", "", ..]
andtitles_mask: List[0, 0, ..]
- Some titles in batch: some
Document
has"name"
key and value =>titles: List["", title1, ..]
andtitles_mask: List[0, 1, ..]
- all samples has titles: all
Document
has"name"
key and value =>titles: List[title1, title2, ..]
andtitles_mask: List[1, 1, ..]
embed_title == False
: =>titles: List["", "", ..]
andtitles_mask: List[0, 0, ..]
TO-DO:
_generate_batch_predictions
to reflect DPREncoderOutput_prepare_model
function to load the saved checkpoint into DPREncoder modelsTest performance on top_k=20 retrieved documents on nq_dev.json
DPRQuestionEncoder
andDPRContextEncoder
and modelhub modelsembed_passages