Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Decompose LanguageModel contextualizer into forward_ and backward_ contextualizer #2438

Closed
wants to merge 17 commits into from

Conversation

nelson-liu
Copy link
Contributor

As discussed in #2373 , this PR adds forward_contextualizer and backward_contextualizer arguments to the LanguageModel, towards bidirectional language modeling of contiguous text.

@nelson-liu
Copy link
Contributor Author

FYI @brendan-ai2 , I was able to retrain the model at /~https://github.com/allenai/allennlp/blob/master/training_config/constituency_parser_transformer_elmo.jsonnet with these changes, so seems like backwards compatibility is maintained.

self._backward_contextualizer(embeddings, mask))
# Concatenate the backward contextual embeddings to the
# forward contextual embeddings
if (isinstance(contextual_embeddings, list) and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendan-ai2 i'm struggling to test this case (where contextual_embeddings and backward_contextual_embeddings are lists), since I'm not sure when they would return lists :) any pointers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blargh. :/ The contextualizer can be made to return all layers for embedding.

/~https://github.com/allenai/allennlp/blob/master/allennlp/modules/token_embedders/language_model_token_embedder.py#L61

There basically isn't a real API in place right now as this is specific to the transformer contextualizer. We definitely need to figure out how this plays with having more general contextualizers... It is an important feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems hard to get the per-layer outputs of a multilayer pytorch LSTM / other contextualizers in a flexible fashion...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head, a solution would be to modify /~https://github.com/allenai/allennlp/blob/master/allennlp/modules/seq2seq_encoders/seq2seq_encoder.py#L5 so that it takes a constructor arg "return_all_layers" (like with lazy in the DatasetReader) and then have the subclasses do the appropriate thing. Maybe also add forward as an explicit method in Seq2SeqEncoder as well in order to clearly document the behavior and the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main trouble is that the PyTorch LSTM doesn't return the outputs for all layers and all timesteps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, @matt-gardner.

@nelson-liu, I should elaborate a bit. The pure object-oriented approach would be to have a MultiLayerSeq2SeqEncoder that subclasses the existing Seq2SeqEncoder. Annoyingly this would break anyone that's using a vanilla Seq2SeqEncoder currently. Another issue is that when many of these features are independent there is a risk of the class hierarchy becoming very deep resulting in, say, MultiLayerSeq2SeqEncoderWithFooAndBar, MultiLayerSeq2SeqEncoderWithFooAndNotBar, etc. There are a few ways to guard against this. First, one can make judicious use of defaults -- at the risk of weakening (if not breaking) the abstraction as Matt points out. Another option would be to use mixins. (Though that would also break existing users.) You can also adopt a sort of manual approach. This might involve exposing is_multi_layer for user code to query. This is ugly, but can be quite flexible and aid in backwards compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's figure out how/whether to obtain the extra layers and then proceed with caution as Matt suggests. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this mention. We don't need to handle the case that the contextualizer returns multiple layers here, because it will never do that during training.

I think really the problem here has stemmed from the fact we have been passing around the return_all_layers argument as a constructor parameter and not a runtime forward parameter (or as a separate method on a Contextualizer subclass of Seq2SeqEncoder.

Is the right API here this:

class Seq2SeqEncoder

class Contextualizer(Seq2SeqEncoder):

	# Idea 1:
    def forward(sequence: torch.Tensor[batch, sequence, embedding],
				return_all_layers = False) -> Union[torch.Tensor[batch, sequence, embedding], List[of the same thing]]

	# this way all Contextualizers can still function as `Seq2SeqEncoders` by default,
	# which is useful for training and perhaps for downstream use, say if you wanted to
	# fine tune an encoder and didn't want to do an elmo mixture. However, all
	# `ContextualTokenEmbedders` could call `Contextualizer.forward(return_all_layers=True)`.

	# Second idea:

    def forward(sequence: torch.Tensor[batch, sequence, embedding]) -> torch.Tensor[batch, sequence, embedding]

	def get_layers(sequence: torch.Tensor) -> List[torch.Tensor[batch, sequence, embedding]]:

	# A separate method which basically implements the functionality above.
	# This would get around the type problems of `forward` possibly returning lists
	# of tensors when used as a `Seq2SeqEncoder`.

This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!

Copy link
Contributor Author

@nelson-liu nelson-liu Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more thoughts later, but:

This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!

How else would you do language modeling of contiguous text? At the very least, it'd be useful to add the (left-to-right) LM functionality to AllenNLP so people can easily train on contiguous-text datasets like the PTB or wikitext. I think having the bidirectional contiguous text LM would be useful as well, since I'm actually curious to run the experiment and find out whether contiguous text matters (I suspect it does, or at least doesn't hurt). Bidirectionality is definitely crucial to getting the best contextual representations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would #2716 help here?

@brendan-ai2
Copy link
Contributor

Cool! To double check, when you retrained the constituency parser which trained LM did you use to perform the embedding? Ideally it would be the old one I linked you previously.

Also, fyi, my responses will be slow/sporadic this week due to AAAI.

@nelson-liu
Copy link
Contributor Author

Cool! To double check, when you retrained the constituency parser which trained LM did you use to perform the embedding? Ideally it would be the old one I linked you previously.

Yup, I used the old one.

Also, fyi, my responses will be slow/sporadic this week due to AAAI.

No worries, have a good time!

@brendan-ai2
Copy link
Contributor

@nelson-liu, just wanted to check in on this and make sure I'm not blocking you. Are you still proceeding with this approach?

Copy link
Contributor

@DeNeutoy DeNeutoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to add my comments to this, I missed the first mention!

@@ -74,6 +75,23 @@ class LanguageModel(Model):
contextualizer: ``Seq2SeqEncoder``
Used to "contextualize" the embeddings. As described above,
this encoder must not cheat by peeking ahead.

.. deprecated:: 0.8.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: To me this seems like we are ripping out quite a good api for one which is more complicated. We already have a trained BidirectionalLanguageModel which uses a bidirectional transformer, which is very useful to many people (e.g github issues/ Swabha is using it in her research), which makes me unsure that depreciating a key component of it is a good idea. Do we know that bidirectional language modeling of contiguous text is worth it? I can't imagine a downstream task where you actually need unlimited context (and even if there was one, how that would practically work). Is it worth ripping this into a separate repo, making the changes and double checking that contiguous text does something useful?

Additionally, we now have a BidirectionalLanguageModel subclass which is subsumed by this code (I think). This is "bad code smell" to me when we have a subclass which only differs in constructor arguments and we only have a single example of how those arguments might differ (it triggers my MultiClassMultipleChoiceMemoryNetwork(MemoryNetwork, MultiClass, MultipleChoice) sense from DeepQa).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the API decision to others, here's some motivation for enabling modeling of longer contexts: https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html. They basically make a stateful transformer, showing pretty big gains.

self._backward_contextualizer(embeddings, mask))
# Concatenate the backward contextual embeddings to the
# forward contextual embeddings
if (isinstance(contextual_embeddings, list) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this mention. We don't need to handle the case that the contextualizer returns multiple layers here, because it will never do that during training.

I think really the problem here has stemmed from the fact we have been passing around the return_all_layers argument as a constructor parameter and not a runtime forward parameter (or as a separate method on a Contextualizer subclass of Seq2SeqEncoder.

Is the right API here this:

class Seq2SeqEncoder

class Contextualizer(Seq2SeqEncoder):

	# Idea 1:
    def forward(sequence: torch.Tensor[batch, sequence, embedding],
				return_all_layers = False) -> Union[torch.Tensor[batch, sequence, embedding], List[of the same thing]]

	# this way all Contextualizers can still function as `Seq2SeqEncoders` by default,
	# which is useful for training and perhaps for downstream use, say if you wanted to
	# fine tune an encoder and didn't want to do an elmo mixture. However, all
	# `ContextualTokenEmbedders` could call `Contextualizer.forward(return_all_layers=True)`.

	# Second idea:

    def forward(sequence: torch.Tensor[batch, sequence, embedding]) -> torch.Tensor[batch, sequence, embedding]

	def get_layers(sequence: torch.Tensor) -> List[torch.Tensor[batch, sequence, embedding]]:

	# A separate method which basically implements the functionality above.
	# This would get around the type problems of `forward` possibly returning lists
	# of tensors when used as a `Seq2SeqEncoder`.

This doesn't handle Nelson's case that the contextualizers might be stateful during training. I haven't thought hard about that because i'm not 100% convinced that we need to support it as part of a concrete API yet, but i'm happy to be convinced!

@matt-gardner
Copy link
Contributor

@nelson-liu, how do you want to proceed with this?

@nelson-liu
Copy link
Contributor Author

I plan on coming back to this eventually, just got caught up with some other stuff...

@matt-gardner
Copy link
Contributor

@nelson-liu, is now a good time to come back to this?

@DeNeutoy
Copy link
Contributor

Closing due to inactivity and the Age Of The Transformer, which makes this PR not super important.

@DeNeutoy DeNeutoy closed this Nov 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants