-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tapas reader with scores #1997
Conversation
…tapas_with_scores
…tapas_with_scores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Left a few comments. Still a bit unsure if the current implementation of scores makes sense or if we can do a better "first version" here 🤔
Once those points are resolved, we also need to update our documentation here: https://haystack.deepset.ai/guides/table-qa
haystack/nodes/reader/table.py
Outdated
# Sort answers by score and select top-k answers | ||
if isinstance(self.model, self.TapasForScoredQA): | ||
# Answers are sorted first by the general score for the tables and then by the answer span score | ||
answers = sorted(answers, reverse=True, key=lambda ans: (ans.meta["table_score"], ans.score)) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the sorted list of Answers that the reader returns the first Answer.score
might be lower than the second one? If yes, that might be confusing to show to end users ("Why has the first result a lower score?"). Maybe we can combine table_score and answer_score somehow?
haystack/nodes/reader/table.py
Outdated
|
||
# Get general table score | ||
table_score = self.model.classifier(outputs.pooler_output) | ||
table_score = table_score[0][1] - table_score[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the two elements that we subtract here? Can you add a comment. Is this score already scaled between [0,1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During training, logit 1
is used for positive tables (i.e. tables that contain the answer) and logit 0
for negative tables. The score is not scaled between [0, 1]. Using softmax does not really help, as logit 0
seems to be always a lot higher than logit 1
.
# Calculate score for each possible span | ||
span_logits = torch.einsum("bsj,j->bs", concatenated_logit_tensors, self.model.span_output_weights) \ | ||
+ self.model.span_output_bias | ||
span_logits_softmax = torch.nn.functional.softmax(span_logits, dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our earlier discussion, I thought we take the softmax over all answers. Here it's rather the softmax over the answers from a single table. IMO this makes only sense for the final Answer.score
if we include the table_score somehow, too. Otherwise, we might see high scores for "bad tables" that have a single "decent answer candidate" but no other good candidates. Have you tried this on a few examples? Did the ordering of results / scores make somehow sense?
@julian-risch What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Bogdan and I discussed that topic. We agreed (correct me if I am wrong @bogdankostic) that the table_score is more import for ranking than the span_score. Unfortunately the scores of spans from different tables cannot be compared. To prevent multiple answers from having the exact same score (if we took the table_score only), we want to show the span_score to the user in addition to the table_score.
The scenario that you describe with a high score for a "bad table" that has a single "decent answer candidate" but no other good candidates should be ranked low based on the table_score.
If we combine table_score and span_score, we would need to ensure that sorting by the combined score, the results are first sorted by table_score and then by span_score. No matter how large the span_score of a result, the result should be ranked lower than any other result with a larger table_score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the sorting is alright, but imagine TableQA in our current demo UI (or the DC Search UI). As a user, you will see the answers one after the other. Each of them has a single "relevance score" in the UI. What would you display there right now? The current Answer.score
would be misleading here as score(result_rank_1) < score(result_rank_2).
How about adding both individual scores to meta and using a combined one in Answer.score
?
Something in the direction of score = expit( (table_score * 100 + span_score) / 100)
(calibrating the constants there to something that results in a meaningful range between 0-1)
self.span_output_bias = torch.nn.Parameter(torch.zeros([])) | ||
|
||
# table scoring head | ||
self.classifier = torch.nn.Linear(config.hidden_size, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - looks very lean! If I get it right, we are still using the from_pretrained()
model of the parent class (regular tapas) to load the weights for these extra layers?
I guess the tricky part in the conversion was to get the naming of layers right (during export and now import) to make sure we load the right weights here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
@tholor |
This PR adds support for the table reader models described in the paper 'Open Domain Question Answering over Tables via Dense Retrieval' by Herzig et al. (2021). Their advantage compared to the existing Tapas table reader models is that they provide meaningful answer scores such that the resulting list of answers can be sorted by confidence. Keep in mind that the answers are first sorted based on a general table score (stored in the answers meta field) and then by the answer span score.
One disadvantage is that these models are not capable of answering questions that require aggregation over multiple cells to be answered.
The original tensorflow model checkpoints (available here) have been converted to Pytorch and uploaded to the Huggingface model hub (here and here). The conversion script can be found here (a fork of the transformers repository)