Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor DPR from FB to Transformers codebase #308

Merged
merged 42 commits into from
Aug 25, 2020

Conversation

kolk
Copy link
Contributor

@kolk kolk commented Aug 12, 2020

Breaking changes:

  • DPR instantiation:
    signature:
DensePassageRetriever( document_store: BaseDocumentStore,
                query_embedding_model: str,
                passage_embedding_model: str,
                max_seq_len: int = 256,
                projection_dim: int = 0,
                use_gpu: bool = True,
                batch_size: int = 16,
                do_lower_case: bool = False,
                use_amp: str = None)

example call:

dpr_object = DensePassageRetriever(document_store=document_store, query_embedding_model="facebook/dpr-question_encoder-single-nq-base", passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base", use_gpu=True)
  • tokenizers: Query and passage tokenizers instantiated from DPRQuestionEncoderTokenizer and DPRContextEncoderTokenizer

example call:

query_config = DPRConfig(projection_dim=0, config="facebook/dpr-question_encoder-single-nq-base", dropout=0.0)
query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base"
, config=query_config)
  • tensorizer:
    signature
    _tensorizer(tokenizer, title: Optional[List[str]], text: List[str], add_special_tokens: bool = True)

example call:

 token_ids_batch, token_seg_batch, token_attn_mask = dpr_instance._tensorizer(tokenizer, text=ctx_text, title=ctx_title)
  • Encoders: Query and passage encoders instantiated from DPRQuestionEncoder and DPRContextEncoder
    example call:
query_config = DPRConfig(projection_dim=0, config="facebook/dpr-question_encoder-single-nq-base", dropout=0.0)
query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base", config=query_config)
  • generate_batch_predictions:
    Modified to include titles_mask parameter for handling batches with some missing titles
    example call:
dpr_object._generate_batch_predictions(texts,  encoder_model, tokenizer, titles, titles_mask, batch_size) 

Title inclusion in passages:
embed_title == True:
- No titles in batch: no "name" key in any Document in batch or no meta attribute => titles: List["", "", ..] and titles_mask: List[0, 0, ..]
- Some titles in batch: some Document has "name" key and value => titles: List["", title1, ..] and titles_mask: List[0, 1, ..]
- all samples has titles: all Document has "name" key and value => titles: List[title1, title2, ..] and titles_mask: List[1, 1, ..]

embed_title == False: => titles: List["", "", ..] and titles_mask: List[0, 0, ..]

TO-DO:

  • Change DPR query and passage encoders from official FB HFBertEncoder to transformers DPREncoders.
  • Modify _generate_batch_predictions to reflect DPREncoderOutput
  • Modify _prepare_model function to load the saved checkpoint into DPREncoder models
  • Remove BertTensorizer
  • Refactor DPR model download and saved state parameters
    Test performance on top_k=20 retrieved documents on nq_dev.json
  1. Official FB encoder modules
Retriever Recall: 0.9345536176826109
Retriever Mean Avg Precision: 0.7782416555082636
  1. transformers DPRQuestionEncoder and DPRContextEncoder and modelhub models
Retriever Recall: 0.9345536176826109
Retriever Mean Avg Precision: 0.7782502895956392
  • Modify Tutorial-5 eval scripts
  • Include transformers DPR modules in dpr_utils until it is supported by transformers==3.0.2
  • Update Tutorial-5 and Tutoriacl-6 ipython notebooks
  • Handle title use-cases for embed_passages

@kolk kolk requested a review from Timoeller August 12, 2020 10:21
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion to make the code simpler.

@kolk kolk requested a review from Timoeller August 12, 2020 15:07
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still some very custom code in there that should be changed

@kolk kolk requested a review from Timoeller August 13, 2020 10:43
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made some minor suggestions (might be a bit nit picky, but lets strive for best code possible)

Lets move the HF DPR classes to dpr_utils, remove the old fb code and then this looks very much like we can merge! : )

@kolk kolk requested a review from Timoeller August 14, 2020 13:51
@kolk kolk requested review from tholor and Timoeller and removed request for Timoeller August 17, 2020 15:23
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Small changes required. Mostly naming and docstrings.

@Timoeller
Copy link
Contributor

Hey, a minor thing I noticed that happened to others here before. Its not a big thing but lets try to avoid this in the future:
I have seen you also commited the output of tutorial notebooks 5+6. Please remove the cell output before commiting since it pollutes the changes a lot.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some thoughts on loading the models with potential for improvement. Happy to start a discussion here, I might not be aware of all necessary details.

@kolk kolk requested a review from Timoeller August 20, 2020 13:26
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller changes required. I am also not sure yet about the handling of titles when some docs have it and some don't.

@kolk kolk requested a review from tholor August 25, 2020 08:57
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now! I only updated one docstring slightly.

@kolk kolk merged commit f2b6cc7 into master Aug 25, 2020
@tholor tholor mentioned this pull request Aug 31, 2020
@julian-risch julian-risch deleted the swtich_dpr_encoder_to_transformers branch November 15, 2021 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants