Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

[WIP] Language Modeling of Contiguous Text #2414

Closed
wants to merge 19 commits into from

Conversation

nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Jan 22, 2019

This PR enables language models with stateful encoders to properly work on contiguous text datasets (e.g., BookCorpus).

To Do:

  • Make LanguageModelingReader actually usable
  • Test LanguageModelingReader
  • Create special LanguageModelingIterator (or some other name) that turns off shuffling and also constrains the batch size to 1.
  • Modify LanguageModel to handle instances from LanguageModelingReader
  • Add bidirectionality to LanguageModelingReader and LanguageModel
  • Test LanguageModel

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Jan 22, 2019

@matt-gardner / @brendan-ai2 : one annoying thing is that tensorizing the instances from the dataset reader would give you an extra singleton batch dimension --- is it reasonable to just remove that in the iterator (i think you can, but not positive)?

@matt-gardner
Copy link
Contributor

I would say yes, it's reasonable to remove that extra dimension in the iterator, which is another strong motivation for having a language-modeling-specific iterator.

@brendan-ai2
Copy link
Contributor

Please ping me when you're ready for a review on this.

@nelson-liu
Copy link
Contributor Author

Thanks @brendan-ai2 , will do! Still needs a bit of cleanup.

@nelson-liu
Copy link
Contributor Author

Aside: I'm pretty sure the bidirectional contiguous text setting won't be correct until #2373 is fixed.

@matt-gardner
Copy link
Contributor

Does bidirectional contiguous text work at all? I don't think you can train both directions at the same time in a stateful way, right?

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Jan 24, 2019

Right, it's currently broken because we need different contextualizers (different parameters) for the forward and backward. But this is what #2373 should fix, no? Or am i misunderstanding?

@matt-gardner
Copy link
Contributor

matt-gardner commented Jan 24, 2019

The whole point of your dataset reader is to get contiguous chunks that let you see one batch after another of streaming text, in order. But it's in order only for one direction. A stateful backwards LSTM would do the wrong thing (even with #2373 fixed) if it sees forward-streaming batches.

@nelson-liu
Copy link
Contributor Author

ah, i see what you mean now. i guess this could work if you have separate inputs for your forward and backward (i.e., forward starts from the first index, and backward from the last)?

@matt-gardner
Copy link
Contributor

That would still break a bidirectional encoder (because the backward LSTM on the forward input would be wrong, and the forward LSTM on the backward input would be wrong). This method of training is only suitable for a single direction at a time.

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Jan 24, 2019

talked with @matt-gardner on slack, seems like the reasonable fix would be to have the model take 2 inputs (for each direction) and 2 targets (for each direction). Furthermore, the model would use a shared embedding layer for the inputs and two different contextualizers (one forward-direction, one backward-direction).

@matt-peters
Copy link
Contributor

This is the way the original ELMo training code works. There are two separate LSTMs (one for each direction), but shared softmax layer and word embeddings. The data generator creates batches in exactly this manner, by returning both forward and reverse ids.

@brendan-ai2
Copy link
Contributor

@nelson-liu generously took the time to explain some of the delicate points of contiguous language modeling to me. Seems like we're all in agreement on having multiple contextualizers and targets.

# Read the contents of the file into one long list of tokens,
# adding start and/or end tokens as necessary.
file_tokens = []
file_tokens.extend(self._start_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in the forloop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you're correct (sorry I haven't looked at this PR in awhile). In general, though, language modeling of contiguous text doesn't use any start tokens and only uses eos tokens (e.g., see /~https://github.com/salesforce/awd-lstm-lm/blob/master/data.py#L34-L54 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but this dataset reader is not for training a LM like ELMo. ELMo is trained on shuffled sentences, hence why it needs a start token for each sentence. This dataset reader is for contiguous text (e.g., books), like the corpora that the OpenAI GPT was trained on / the PTB LM benchmark most folks use. In this setting, people don't typically use start tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it, thanks -- I assumed that stateful (RNN) language models were always trained in contiguous fashion.
The shuffling bit might help generalization performance for sentence-level tasks (where the model doesn't learn to rely on an initialized hidden state), but I'm wondering how the same model with and without contiguous training would perform in transfer learning.
Thanks!

@bratao
Copy link
Contributor

bratao commented May 30, 2019

@nelson-liu This is a AWESOME feature and needed in many scenarios I'm working with.
Did you have plan to merging this PR?

@matt-gardner
Copy link
Contributor

@nelson-liu, what's stopping this from getting merged?

@nelson-liu
Copy link
Contributor Author

So we first have to figure out #2373 , which PR #2438 attempts to do. #2438 is blocked by #2438 (comment)

I also think that @rloganiv is working on contiguous-text language modeling, so we should try to avoid duplication of effort.

super().__init__(lazy)
start_tokens: List[str] = None,
end_tokens: List[str] = ["</S>"]) -> None:
super().__init__(lazy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting lazy=False defeats the purpose of using fuzz_truncated_bppt_size=True during training. The point of using random sequence lengths is to prevent batches ending with same tokens each epoch, however this will not happen if lazy=False since _read will only be called once. It probably makes more sense to let users set lazy themselves and raise an error if fuzz_truncated_bppt_size=True and lazy=False.



@DataIterator.register("language_modeling")
class LanguageModelingIterator(BasicIterator):
Copy link
Contributor

@rloganiv rloganiv Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to give this object a more general name - it is useful for any problem where batching is done in the DatasetReader (e.g. the problem in #2828). Maybe something like StraightThroughIterator or TrivialIterator?

@rloganiv rloganiv mentioned this pull request Jun 27, 2019
@bratao
Copy link
Contributor

bratao commented Sep 2, 2019

@nelson-liu sorry to ping you about this again. But I'm planning for an upcoming research that will need LM of Contiguous Text. Do you plan to resume this?

@nelson-liu
Copy link
Contributor Author

yes, i would like to have this working at some point. I know @rloganiv was also working on this at some point, and @matt-gardner recently merged some LM things. I'm traveling until mid-september, but I'm hoping to try to make some progress on it afterwards.

@nelson-liu
Copy link
Contributor Author

< sorry for dropping the ball on this. yes, i would like to have this working at some point. I know @rloganiv was also working on this at some point, and @matt-gardner recently merged some LM things. I'm traveling until mid-september, but I'm hoping to try to make some progress on it afterwards.

@rloganiv
Copy link
Contributor

rloganiv commented Sep 4, 2019

@bratao This PR is currently blocked due to the thorny issue of bi-directionality (see #2438). If you only need a forward generative LM then you should be able to use the existing LanguageModel implementation, with the following caveats:

  1. You will need to set stateful=True for the contextualizer
  2. Your dataset reader will need to handle batching (you could use the LanguageModelingReader in this PR), and you will need to use the PassThroughIterator as your iterator.

Hope this helps!

@DeNeutoy
Copy link
Contributor

This PR has been open since Jan, i'm declaring it done - re-open if someone wants to fix it.

@DeNeutoy DeNeutoy closed this Nov 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants