-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Decompose LanguageModel contextualizer into forward_ and backward_ contextualizer #2438
Conversation
FYI @brendan-ai2 , I was able to retrain the model at /~https://github.com/allenai/allennlp/blob/master/training_config/constituency_parser_transformer_elmo.jsonnet with these changes, so seems like backwards compatibility is maintained. |
self._backward_contextualizer(embeddings, mask)) | ||
# Concatenate the backward contextual embeddings to the | ||
# forward contextual embeddings | ||
if (isinstance(contextual_embeddings, list) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brendan-ai2 i'm struggling to test this case (where contextual_embeddings
and backward_contextual_embeddings
are lists), since I'm not sure when they would return lists :) any pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blargh. :/ The contextualizer can be made to return all layers for embedding.
There basically isn't a real API in place right now as this is specific to the transformer contextualizer. We definitely need to figure out how this plays with having more general contextualizers... It is an important feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems hard to get the per-layer outputs of a multilayer pytorch LSTM / other contextualizers in a flexible fashion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of my head, a solution would be to modify /~https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/seq2seq_encoder.py#L5 so that it takes a constructor arg "return_all_layers" (like with lazy in the DatasetReader
) and then have the subclasses do the appropriate thing. Maybe also add forward
as an explicit method in Seq2SeqEncoder
as well in order to clearly document the behavior and the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main trouble is that the PyTorch LSTM doesn't return the outputs for all layers and all timesteps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point, @matt-gardner.
@nelson-liu, I should elaborate a bit. The pure object-oriented approach would be to have a MultiLayerSeq2SeqEncoder
that subclasses the existing Seq2SeqEncoder
. Annoyingly this would break anyone that's using a vanilla Seq2SeqEncoder
currently. Another issue is that when many of these features are independent there is a risk of the class hierarchy becoming very deep resulting in, say, MultiLayerSeq2SeqEncoderWithFooAndBar
, MultiLayerSeq2SeqEncoderWithFooAndNotBar
, etc. There are a few ways to guard against this. First, one can make judicious use of defaults -- at the risk of weakening (if not breaking) the abstraction as Matt points out. Another option would be to use mixins. (Though that would also break existing users.) You can also adopt a sort of manual approach. This might involve exposing is_multi_layer
for user code to query. This is ugly, but can be quite flexible and aid in backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's figure out how/whether to obtain the extra layers and then proceed with caution as Matt suggests. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this mention. We don't need to handle the case that the contextualizer returns multiple layers here, because it will never do that during training.
I think really the problem here has stemmed from the fact we have been passing around the return_all_layers
argument as a constructor parameter and not a runtime forward
parameter (or as a separate method on a Contextualizer
subclass of Seq2SeqEncoder
.
Is the right API here this:
class Seq2SeqEncoder
class Contextualizer(Seq2SeqEncoder):
# Idea 1:
def forward(sequence: torch.Tensor[batch, sequence, embedding],
return_all_layers = False) -> Union[torch.Tensor[batch, sequence, embedding], List[of the same thing]]
# this way all Contextualizers can still function as `Seq2SeqEncoders` by default,
# which is useful for training and perhaps for downstream use, say if you wanted to
# fine tune an encoder and didn't want to do an elmo mixture. However, all
# `ContextualTokenEmbedders` could call `Contextualizer.forward(return_all_layers=True)`.
# Second idea:
def forward(sequence: torch.Tensor[batch, sequence, embedding]) -> torch.Tensor[batch, sequence, embedding]
def get_layers(sequence: torch.Tensor) -> List[torch.Tensor[batch, sequence, embedding]]:
# A separate method which basically implements the functionality above.
# This would get around the type problems of `forward` possibly returning lists
# of tensors when used as a `Seq2SeqEncoder`.
This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more thoughts later, but:
This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!
How else would you do language modeling of contiguous text? At the very least, it'd be useful to add the (left-to-right) LM functionality to AllenNLP so people can easily train on contiguous-text datasets like the PTB or wikitext. I think having the bidirectional contiguous text LM would be useful as well, since I'm actually curious to run the experiment and find out whether contiguous text matters (I suspect it does, or at least doesn't hurt). Bidirectionality is definitely crucial to getting the best contextual representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would #2716 help here?
Cool! To double check, when you retrained the constituency parser which trained LM did you use to perform the embedding? Ideally it would be the old one I linked you previously. Also, fyi, my responses will be slow/sporadic this week due to AAAI. |
Yup, I used the old one.
No worries, have a good time! |
@nelson-liu, just wanted to check in on this and make sure I'm not blocking you. Are you still proceeding with this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to add my comments to this, I missed the first mention!
@@ -74,6 +75,23 @@ class LanguageModel(Model): | |||
contextualizer: ``Seq2SeqEncoder`` | |||
Used to "contextualize" the embeddings. As described above, | |||
this encoder must not cheat by peeking ahead. | |||
|
|||
.. deprecated:: 0.8.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: To me this seems like we are ripping out quite a good api for one which is more complicated. We already have a trained BidirectionalLanguageModel
which uses a bidirectional transformer, which is very useful to many people (e.g github issues/ Swabha is using it in her research), which makes me unsure that depreciating a key component of it is a good idea. Do we know that bidirectional language modeling of contiguous text is worth it? I can't imagine a downstream task where you actually need unlimited context (and even if there was one, how that would practically work). Is it worth ripping this into a separate repo, making the changes and double checking that contiguous text does something useful?
Additionally, we now have a BidirectionalLanguageModel
subclass which is subsumed by this code (I think). This is "bad code smell" to me when we have a subclass which only differs in constructor arguments and we only have a single example of how those arguments might differ (it triggers my MultiClassMultipleChoiceMemoryNetwork(MemoryNetwork, MultiClass, MultipleChoice)
sense from DeepQa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the API decision to others, here's some motivation for enabling modeling of longer contexts: https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html. They basically make a stateful transformer, showing pretty big gains.
self._backward_contextualizer(embeddings, mask)) | ||
# Concatenate the backward contextual embeddings to the | ||
# forward contextual embeddings | ||
if (isinstance(contextual_embeddings, list) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this mention. We don't need to handle the case that the contextualizer returns multiple layers here, because it will never do that during training.
I think really the problem here has stemmed from the fact we have been passing around the return_all_layers
argument as a constructor parameter and not a runtime forward
parameter (or as a separate method on a Contextualizer
subclass of Seq2SeqEncoder
.
Is the right API here this:
class Seq2SeqEncoder
class Contextualizer(Seq2SeqEncoder):
# Idea 1:
def forward(sequence: torch.Tensor[batch, sequence, embedding],
return_all_layers = False) -> Union[torch.Tensor[batch, sequence, embedding], List[of the same thing]]
# this way all Contextualizers can still function as `Seq2SeqEncoders` by default,
# which is useful for training and perhaps for downstream use, say if you wanted to
# fine tune an encoder and didn't want to do an elmo mixture. However, all
# `ContextualTokenEmbedders` could call `Contextualizer.forward(return_all_layers=True)`.
# Second idea:
def forward(sequence: torch.Tensor[batch, sequence, embedding]) -> torch.Tensor[batch, sequence, embedding]
def get_layers(sequence: torch.Tensor) -> List[torch.Tensor[batch, sequence, embedding]]:
# A separate method which basically implements the functionality above.
# This would get around the type problems of `forward` possibly returning lists
# of tensors when used as a `Seq2SeqEncoder`.
This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!
@nelson-liu, how do you want to proceed with this? |
I plan on coming back to this eventually, just got caught up with some other stuff... |
@nelson-liu, is now a good time to come back to this? |
Closing due to inactivity and the Age Of The Transformer, which makes this PR not super important. |
As discussed in #2373 , this PR adds
forward_contextualizer
andbackward_contextualizer
arguments to theLanguageModel
, towards bidirectional language modeling of contiguous text.