New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Tapas reader with scores #1997

Merged

bogdankostic merged 15 commits into master from add_tapas_with_scores

Jan 31, 2022

Contributor

bogdankostic commented Jan 12, 2022 •

edited

Loading

This PR adds support for the table reader models described in the paper 'Open Domain Question Answering over Tables via Dense Retrieval' by Herzig et al. (2021). Their advantage compared to the existing Tapas table reader models is that they provide meaningful answer scores such that the resulting list of answers can be sorted by confidence. Keep in mind that the answers are first sorted based on a general table score (stored in the answers meta field) and then by the answer span score.
One disadvantage is that these models are not capable of answering questions that require aggregation over multiple cells to be answered.

The original tensorflow model checkpoints (available here) have been converted to Pytorch and uploaded to the Huggingface model hub (here and here). The conversion script can be found here (a fork of the transformers repository)

bogdankostic and others added 9 commits

January 12, 2022 19:54


          Add Tapas reader with scores

163e943


          Adapt possible answer spans

386e187


          Add latest docstring and tutorial changes

2cf0638


          Remove unused imports

5e902f5


          Merge remote-tracking branch 'origin/add_tapas_with_scores' into add_…

44d1373

…tapas_with_scores


          Adapt scoring

69abbf4


          Add latest docstring and tutorial changes

495e9e1


          Fix mypy

e1fee07


          Merge remote-tracking branch 'origin/add_tapas_with_scores' into add_…

a992b02

…tapas_with_scores

bogdankostic marked this pull request as ready for review

January 17, 2022 09:36

bogdankostic requested a review from tholor

January 17, 2022 09:36

tholor reviewed

View reviewed changes

Member

tholor left a comment •

edited

Loading

Looking good! Left a few comments. Still a bit unsure if the current implementation of scores makes sense or if we can do a better "first version" here 🤔

Once those points are resolved, we also need to update our documentation here: https://haystack.deepset.ai/guides/table-qa

haystack/nodes/reader/table.py Outdated Show resolved Hide resolved

haystack/nodes/reader/table.py Outdated

+                      # Sort answers by score and select top-k answers
+                      if isinstance(self.model, self.TapasForScoredQA):
+                          # Answers are sorted first by the general score for the tables and then by the answer span score
+                          answers = sorted(answers, reverse=True, key=lambda ans: (ans.meta["table_score"], ans.score))  # type: ignore

Member

tholor Jan 18, 2022

So in the sorted list of Answers that the reader returns the first Answer.score might be lower than the second one? If yes, that might be confusing to show to end users ("Why has the first result a lower score?"). Maybe we can combine table_score and answer_score somehow?

haystack/nodes/reader/table.py Outdated

+                      # Get general table score
+                      table_score = self.model.classifier(outputs.pooler_output)
+                      table_score = table_score[0][1] - table_score[0][0]

Member

tholor Jan 18, 2022

What are the two elements that we subtract here? Can you add a comment. Is this score already scaled between [0,1]?

Contributor Author

bogdankostic Jan 19, 2022

During training, logit 1 is used for positive tables (i.e. tables that contain the answer) and logit 0 for negative tables. The score is not scaled between [0, 1]. Using softmax does not really help, as logit 0 seems to be always a lot higher than logit 1.

haystack/nodes/reader/table.py

+                      # Calculate score for each possible span
+                      span_logits = torch.einsum("bsj,j->bs", concatenated_logit_tensors, self.model.span_output_weights) \
+                                    + self.model.span_output_bias
+                      span_logits_softmax = torch.nn.functional.softmax(span_logits, dim=1)

Member

tholor Jan 18, 2022

In our earlier discussion, I thought we take the softmax over all answers. Here it's rather the softmax over the answers from a single table. IMO this makes only sense for the final Answer.score if we include the table_score somehow, too. Otherwise, we might see high scores for "bad tables" that have a single "decent answer candidate" but no other good candidates. Have you tried this on a few examples? Did the ordering of results / scores make somehow sense?

@julian-risch What do you think?

Member

julian-risch Jan 18, 2022

Yes, Bogdan and I discussed that topic. We agreed (correct me if I am wrong @bogdankostic) that the table_score is more import for ranking than the span_score. Unfortunately the scores of spans from different tables cannot be compared. To prevent multiple answers from having the exact same score (if we took the table_score only), we want to show the span_score to the user in addition to the table_score.
The scenario that you describe with a high score for a "bad table" that has a single "decent answer candidate" but no other good candidates should be ranked low based on the table_score.
If we combine table_score and span_score, we would need to ensure that sorting by the combined score, the results are first sorted by table_score and then by span_score. No matter how large the span_score of a result, the result should be ranked lower than any other result with a larger table_score.

Member

tholor Jan 19, 2022

Yeah, I think the sorting is alright, but imagine TableQA in our current demo UI (or the DC Search UI). As a user, you will see the answers one after the other. Each of them has a single "relevance score" in the UI. What would you display there right now? The current Answer.score would be misleading here as score(result_rank_1) < score(result_rank_2).

How about adding both individual scores to meta and using a combined one in Answer.score?
Something in the direction of score = expit( (table_score * 100 + span_score) / 100) (calibrating the constants there to something that results in a meaningful range between 0-1)

haystack/nodes/reader/table.py

+                          self.span_output_bias = torch.nn.Parameter(torch.zeros([]))
+                          # table scoring head
+                          self.classifier = torch.nn.Linear(config.hidden_size, 2)

Member

tholor Jan 18, 2022

Nice - looks very lean! If I get it right, we are still using the from_pretrained() model of the parent class (regular tapas) to load the weights for these extra layers?
I guess the tricky part in the conversion was to get the naming of layers right (during export and now import) to make sure we load the right weights here?

Contributor Author

bogdankostic Jan 19, 2022

Yes, exactly.


          Infer model architecture from config

8a31235

CLAassistant commented Jan 20, 2022 •

edited

Loading

All committers have signed the CLA.

bogdankostic and others added 4 commits

January 23, 2022 23:14


          Adapt answer score calculation

caf55d5


          Add latest docstring and tutorial changes

09b176b


          Fix mypy

73692b8


          Merge remote-tracking branch 'origin/add_tapas_with_scores' into add_…

3cdff29

…tapas_with_scores

Contributor Author

bogdankostic commented Jan 23, 2022

@tholor
The output of the table scoring head consists of two logits, one of them is used for positive tables and the other for negative tables. As for the original model checkpoints the number of negative sample was approx. 10x larger than the number of positive samples, the table scoring head was extremely biased towards predicting that a table does not contain the answer to a question.
To solve this problem, I retrained the table scoring head while freezing all other parameters of the models. As training data, I used the NQTables dataset. For each positive sample, I sampled one random table as negative table. I uploaded the adapted models to the model hub. (The original model checkpoints can still be accesed using revision="original".)

bogdankostic requested a review from tholor

January 23, 2022 23:14


          Merge branch 'master' into add_tapas_with_scores

4d37114

tholor approved these changes

View reviewed changes

bogdankostic merged commit bbb65a1 into master

bogdankostic deleted the add_tapas_with_scores branch

January 31, 2022 09:23

bogdankostic mentioned this pull request

Meaningful confidence scores and no_answers for TableReader #1722

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet