use starting offsets in the srl model so output is wellformed #2972

DeNeutoy · 2019-06-19T18:14:06Z

This PR does 3 mildly inter-related things:

Removes some unnecessary index shifting (randomly adding and minusing one everywhere I was using wordpieces)
Fixes a problem with decoding using end indices in the SRL model, which caused tags for a word which was split into multiple word pieces to produce an ill formed BIO sequence.
Fixes a bug/missing functionality in viterbi_decode where it was impossible to specify start/end transitions.

…ennlp into wordpiece-wrangling

matt-gardner

I'm pretty confused about what would happen if you wanted to use BIOUL with this, and comments could be clearer in a few places, but otherwise LGTM.

matt-gardner · 2019-06-19T23:03:37Z

allennlp/models/semantic_role_labeler.py

+        start_transitions = torch.zeros(num_labels)
+
+        for i, label in all_labels.items():
+            if label[0] == "I":


Are we guaranteed to always be using BIO here?

matt-gardner · 2019-06-19T23:05:19Z

allennlp/models/srl_bert.py

+
+        NOTE: First, we decode a BIO sequence on top of the wordpieces. This is important; viterbi
+        decoding produces low quality output if you decode on top of word representations directly,
+        because the model already learns very strong preferences for the BIO tag type.


This last phrase isn't clear to me.

Basically if you select out the distributions corresponding to words from the wordpiece sequence and then run inference, the model gets really messed up because it's trained to do BIO tagging on wordpieces - not words.

I clarified the comment

matt-gardner · 2019-06-19T23:05:55Z

allennlp/data/dataset_readers/semantic_role_labeling.py

+        to convert the labels to tags using the end_offsets. However, when we are decoding a
+        BIO sequence inside the SRL model itself, it's important that we use the start_offsets,
+        because otherwise we might select an ill-formed BIO sequence from the BIO sequence on top of
+        wordpieces (this happens in the case that a word is split into multiple word pieces,


I think this comment would be more clear with a simple example.

matt-gardner · 2019-06-19T23:08:23Z

allennlp/models/srl_bert.py

+
+        Secondly, it's important that the indices we use to recover words from the wordpieces are the
+        start_offsets (i.e offsets which correspond to using the first wordpiece of words which are
+        tokenized into multiple wordpieces) as otherwise, we might get an ill-formed BIO sequence


I'm a bit confused about what's going on here, probably because I'm lacking context on how word tags are split into wordpiece tags in the first place. If you get a bunch of I- tags from taking the last wordpiece, wouldn't you get a bunch of B- tags from taking the first one? Or is it that all continuation wordpieces always have I-, but the first wordpiece will have the original word tag? Then this makes more sense.

Unless you're using BIOUL, then this still gets messed up, because you might have L-, I-...

Yes, it's basically that we need the B-XXX tag and not the I-XXX tag for words that have been split into multiple wordpieces.

matt-gardner · 2019-06-19T23:10:39Z

allennlp/nn/util.py

@@ -421,6 +423,14 @@ def viterbi_decode(tag_sequence: torch.Tensor,
        other, or those transitions are extremely unlikely. In this situation we log a
        warning, but the responsibility for providing self-consistent evidence ultimately
        lies with the user.
+    allowed_start_transitions : torch.Tensor, optional, (default = None)
+        An optional tensor of shape (num_tags,) describing which tags the START token
+        may transition TO. If provided, additional transition constraints will be used for


Probably clearer to use *to* instead of TO. Same below.

…i#2972) * use starting offsets in the srl model so output is wellformed * fix bug in viterbi_decode for constrained start and end sequences * add failing tests for srl models without viterbi constraint * fix srl models to use start transitions for bio tagging * lint * fix random bug surfaced in openie predictor * fix more openie tests * clarify comments, PR feedback

Mark Neumann and others added 6 commits June 19, 2019 11:12

use starting offsets in the srl model so output is wellformed

fd0a97f

fix bug in viterbi_decode for constrained start and end sequences

c415052

add failing tests for srl models without viterbi constraint

6a57c0c

fix srl models to use start transitions for bio tagging

9df128b

lint

647ab67

Merge branch 'master' into wordpiece-wrangling

67601dd

DeNeutoy requested a review from matt-gardner June 19, 2019 21:42

Mark Neumann added 2 commits June 19, 2019 15:10

fix random bug surfaced in openie predictor

0324500

Merge branch 'wordpiece-wrangling' of /~https://github.com/DeNeutoy/all…

7ed3159

…ennlp into wordpiece-wrangling

matt-gardner approved these changes Jun 19, 2019

View reviewed changes

Mark Neumann added 2 commits June 19, 2019 16:17

fix more openie tests

0a0d353

clarify comments, PR feedback

3bbc83d

DeNeutoy merged commit 51b74b1 into allenai:master Jun 20, 2019

DeNeutoy mentioned this pull request Aug 19, 2019

BertSrlTest.test_decode_runs_correctly fails #3167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use starting offsets in the srl model so output is wellformed #2972

use starting offsets in the srl model so output is wellformed #2972

DeNeutoy commented Jun 19, 2019 •

edited

Loading

matt-gardner left a comment

matt-gardner Jun 19, 2019

DeNeutoy Jun 19, 2019

matt-gardner Jun 19, 2019

DeNeutoy Jun 19, 2019

matt-gardner Jun 19, 2019

DeNeutoy Jun 19, 2019

matt-gardner Jun 19, 2019

matt-gardner Jun 19, 2019

DeNeutoy Jun 19, 2019

matt-gardner Jun 19, 2019

DeNeutoy Jun 19, 2019

use starting offsets in the srl model so output is wellformed #2972

use starting offsets in the srl model so output is wellformed #2972

Conversation

DeNeutoy commented Jun 19, 2019 • edited Loading

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeNeutoy commented Jun 19, 2019 •

edited

Loading