Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tapas reader with scores #1997

Merged
merged 15 commits into from
Jan 31, 2022
Merged

Add Tapas reader with scores #1997

merged 15 commits into from
Jan 31, 2022

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Jan 12, 2022

This PR adds support for the table reader models described in the paper 'Open Domain Question Answering over Tables via Dense Retrieval' by Herzig et al. (2021). Their advantage compared to the existing Tapas table reader models is that they provide meaningful answer scores such that the resulting list of answers can be sorted by confidence. Keep in mind that the answers are first sorted based on a general table score (stored in the answers meta field) and then by the answer span score.
One disadvantage is that these models are not capable of answering questions that require aggregation over multiple cells to be answered.

The original tensorflow model checkpoints (available here) have been converted to Pytorch and uploaded to the Huggingface model hub (here and here). The conversion script can be found here (a fork of the transformers repository)

@bogdankostic bogdankostic marked this pull request as ready for review January 17, 2022 09:36
@bogdankostic bogdankostic requested a review from tholor January 17, 2022 09:36
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left a few comments. Still a bit unsure if the current implementation of scores makes sense or if we can do a better "first version" here 🤔

Once those points are resolved, we also need to update our documentation here: https://haystack.deepset.ai/guides/table-qa

# Sort answers by score and select top-k answers
if isinstance(self.model, self.TapasForScoredQA):
# Answers are sorted first by the general score for the tables and then by the answer span score
answers = sorted(answers, reverse=True, key=lambda ans: (ans.meta["table_score"], ans.score)) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the sorted list of Answers that the reader returns the first Answer.score might be lower than the second one? If yes, that might be confusing to show to end users ("Why has the first result a lower score?"). Maybe we can combine table_score and answer_score somehow?


# Get general table score
table_score = self.model.classifier(outputs.pooler_output)
table_score = table_score[0][1] - table_score[0][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the two elements that we subtract here? Can you add a comment. Is this score already scaled between [0,1]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During training, logit 1 is used for positive tables (i.e. tables that contain the answer) and logit 0 for negative tables. The score is not scaled between [0, 1]. Using softmax does not really help, as logit 0 seems to be always a lot higher than logit 1.

# Calculate score for each possible span
span_logits = torch.einsum("bsj,j->bs", concatenated_logit_tensors, self.model.span_output_weights) \
+ self.model.span_output_bias
span_logits_softmax = torch.nn.functional.softmax(span_logits, dim=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our earlier discussion, I thought we take the softmax over all answers. Here it's rather the softmax over the answers from a single table. IMO this makes only sense for the final Answer.score if we include the table_score somehow, too. Otherwise, we might see high scores for "bad tables" that have a single "decent answer candidate" but no other good candidates. Have you tried this on a few examples? Did the ordering of results / scores make somehow sense?

@julian-risch What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Bogdan and I discussed that topic. We agreed (correct me if I am wrong @bogdankostic) that the table_score is more import for ranking than the span_score. Unfortunately the scores of spans from different tables cannot be compared. To prevent multiple answers from having the exact same score (if we took the table_score only), we want to show the span_score to the user in addition to the table_score.
The scenario that you describe with a high score for a "bad table" that has a single "decent answer candidate" but no other good candidates should be ranked low based on the table_score.
If we combine table_score and span_score, we would need to ensure that sorting by the combined score, the results are first sorted by table_score and then by span_score. No matter how large the span_score of a result, the result should be ranked lower than any other result with a larger table_score.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the sorting is alright, but imagine TableQA in our current demo UI (or the DC Search UI). As a user, you will see the answers one after the other. Each of them has a single "relevance score" in the UI. What would you display there right now? The current Answer.score would be misleading here as score(result_rank_1) < score(result_rank_2).

How about adding both individual scores to meta and using a combined one in Answer.score?
Something in the direction of score = expit( (table_score * 100 + span_score) / 100) (calibrating the constants there to something that results in a meaningful range between 0-1)

self.span_output_bias = torch.nn.Parameter(torch.zeros([]))

# table scoring head
self.classifier = torch.nn.Linear(config.hidden_size, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - looks very lean! If I get it right, we are still using the from_pretrained() model of the parent class (regular tapas) to load the weights for these extra layers?
I guess the tricky part in the conversion was to get the naming of layers right (during export and now import) to make sure we load the right weights here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

@CLAassistant
Copy link

CLAassistant commented Jan 20, 2022

CLA assistant check
All committers have signed the CLA.

@bogdankostic
Copy link
Contributor Author

@tholor
The output of the table scoring head consists of two logits, one of them is used for positive tables and the other for negative tables. As for the original model checkpoints the number of negative sample was approx. 10x larger than the number of positive samples, the table scoring head was extremely biased towards predicting that a table does not contain the answer to a question.
To solve this problem, I retrained the table scoring head while freezing all other parameters of the models. As training data, I used the NQTables dataset. For each positive sample, I sampled one random table as negative table. I uploaded the adapted models to the model hub. (The original model checkpoints can still be accesed using revision="original".)

@bogdankostic bogdankostic requested a review from tholor January 23, 2022 23:14
@bogdankostic bogdankostic merged commit bbb65a1 into master Jan 31, 2022
@bogdankostic bogdankostic deleted the add_tapas_with_scores branch January 31, 2022 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants