-
Notifications
You must be signed in to change notification settings - Fork 2.3k
use starting offsets in the srl model so output is wellformed #2972
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty confused about what would happen if you wanted to use BIOUL with this, and comments could be clearer in a few places, but otherwise LGTM.
start_transitions = torch.zeros(num_labels) | ||
|
||
for i, label in all_labels.items(): | ||
if label[0] == "I": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed to always be using BIO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 👍
allennlp/models/srl_bert.py
Outdated
|
||
NOTE: First, we decode a BIO sequence on top of the wordpieces. This is important; viterbi | ||
decoding produces low quality output if you decode on top of word representations directly, | ||
because the model already learns very strong preferences for the BIO tag type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last phrase isn't clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically if you select out the distributions corresponding to words from the wordpiece sequence and then run inference, the model gets really messed up because it's trained to do BIO tagging on wordpieces - not words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified the comment
to convert the labels to tags using the end_offsets. However, when we are decoding a | ||
BIO sequence inside the SRL model itself, it's important that we use the start_offsets, | ||
because otherwise we might select an ill-formed BIO sequence from the BIO sequence on top of | ||
wordpieces (this happens in the case that a word is split into multiple word pieces, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment would be more clear with a simple example.
|
||
Secondly, it's important that the indices we use to recover words from the wordpieces are the | ||
start_offsets (i.e offsets which correspond to using the first wordpiece of words which are | ||
tokenized into multiple wordpieces) as otherwise, we might get an ill-formed BIO sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about what's going on here, probably because I'm lacking context on how word tags are split into wordpiece tags in the first place. If you get a bunch of I-
tags from taking the last wordpiece, wouldn't you get a bunch of B-
tags from taking the first one? Or is it that all continuation wordpieces always have I-
, but the first wordpiece will have the original word tag? Then this makes more sense.
Unless you're using BIOUL, then this still gets messed up, because you might have L-
, I-
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's basically that we need the B-XXX
tag and not the I-XXX
tag for words that have been split into multiple wordpieces.
allennlp/nn/util.py
Outdated
@@ -421,6 +423,14 @@ def viterbi_decode(tag_sequence: torch.Tensor, | |||
other, or those transitions are extremely unlikely. In this situation we log a | |||
warning, but the responsibility for providing self-consistent evidence ultimately | |||
lies with the user. | |||
allowed_start_transitions : torch.Tensor, optional, (default = None) | |||
An optional tensor of shape (num_tags,) describing which tags the START token | |||
may transition TO. If provided, additional transition constraints will be used for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably clearer to use *to*
instead of TO
. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…i#2972) * use starting offsets in the srl model so output is wellformed * fix bug in viterbi_decode for constrained start and end sequences * add failing tests for srl models without viterbi constraint * fix srl models to use start transitions for bio tagging * lint * fix random bug surfaced in openie predictor * fix more openie tests * clarify comments, PR feedback
This PR does 3 mildly inter-related things:
Removes some unnecessary index shifting (randomly adding and minusing one everywhere I was using wordpieces)
Fixes a problem with decoding using end indices in the SRL model, which caused tags for a word which was split into multiple word pieces to produce an ill formed BIO sequence.
Fixes a bug/missing functionality in
viterbi_decode
where it was impossible to specify start/end transitions.