Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

use starting offsets in the srl model so output is wellformed #2972

Merged
merged 10 commits into from
Jun 20, 2019

Conversation

DeNeutoy
Copy link
Contributor

@DeNeutoy DeNeutoy commented Jun 19, 2019

This PR does 3 mildly inter-related things:

  1. Removes some unnecessary index shifting (randomly adding and minusing one everywhere I was using wordpieces)

  2. Fixes a problem with decoding using end indices in the SRL model, which caused tags for a word which was split into multiple word pieces to produce an ill formed BIO sequence.

  3. Fixes a bug/missing functionality in viterbi_decode where it was impossible to specify start/end transitions.

@DeNeutoy DeNeutoy requested a review from matt-gardner June 19, 2019 21:42
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confused about what would happen if you wanted to use BIOUL with this, and comments could be clearer in a few places, but otherwise LGTM.

start_transitions = torch.zeros(num_labels)

for i, label in all_labels.items():
if label[0] == "I":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed to always be using BIO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍


NOTE: First, we decode a BIO sequence on top of the wordpieces. This is important; viterbi
decoding produces low quality output if you decode on top of word representations directly,
because the model already learns very strong preferences for the BIO tag type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last phrase isn't clear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically if you select out the distributions corresponding to words from the wordpiece sequence and then run inference, the model gets really messed up because it's trained to do BIO tagging on wordpieces - not words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the comment

to convert the labels to tags using the end_offsets. However, when we are decoding a
BIO sequence inside the SRL model itself, it's important that we use the start_offsets,
because otherwise we might select an ill-formed BIO sequence from the BIO sequence on top of
wordpieces (this happens in the case that a word is split into multiple word pieces,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment would be more clear with a simple example.


Secondly, it's important that the indices we use to recover words from the wordpieces are the
start_offsets (i.e offsets which correspond to using the first wordpiece of words which are
tokenized into multiple wordpieces) as otherwise, we might get an ill-formed BIO sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about what's going on here, probably because I'm lacking context on how word tags are split into wordpiece tags in the first place. If you get a bunch of I- tags from taking the last wordpiece, wouldn't you get a bunch of B- tags from taking the first one? Or is it that all continuation wordpieces always have I-, but the first wordpiece will have the original word tag? Then this makes more sense.

Unless you're using BIOUL, then this still gets messed up, because you might have L-, I-...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's basically that we need the B-XXX tag and not the I-XXX tag for words that have been split into multiple wordpieces.

@@ -421,6 +423,14 @@ def viterbi_decode(tag_sequence: torch.Tensor,
other, or those transitions are extremely unlikely. In this situation we log a
warning, but the responsibility for providing self-consistent evidence ultimately
lies with the user.
allowed_start_transitions : torch.Tensor, optional, (default = None)
An optional tensor of shape (num_tags,) describing which tags the START token
may transition TO. If provided, additional transition constraints will be used for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably clearer to use *to* instead of TO. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DeNeutoy DeNeutoy merged commit 51b74b1 into allenai:master Jun 20, 2019
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
…i#2972)

* use starting offsets in the srl model so output is wellformed

* fix bug in viterbi_decode for constrained start and end sequences

* add failing tests for srl models without viterbi constraint

* fix srl models to use start transitions for bio tagging

* lint

* fix random bug surfaced in openie predictor

* fix more openie tests

* clarify comments, PR feedback
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants