[WIP] Language Modeling of Contiguous Text #2414

nelson-liu · 2019-01-22T00:30:22Z

This PR enables language models with stateful encoders to properly work on contiguous text datasets (e.g., BookCorpus).

To Do:

Make LanguageModelingReader actually usable
Test LanguageModelingReader
Create special LanguageModelingIterator (or some other name) that turns off shuffling and also constrains the batch size to 1.
Modify LanguageModel to handle instances from LanguageModelingReader
Add bidirectionality to LanguageModelingReader and LanguageModel
Test LanguageModel

nelson-liu · 2019-01-22T21:42:26Z

@matt-gardner / @brendan-ai2 : one annoying thing is that tensorizing the instances from the dataset reader would give you an extra singleton batch dimension --- is it reasonable to just remove that in the iterator (i think you can, but not positive)?

matt-gardner · 2019-01-22T22:07:29Z

I would say yes, it's reasonable to remove that extra dimension in the iterator, which is another strong motivation for having a language-modeling-specific iterator.

…s text lm

brendan-ai2 · 2019-01-24T00:10:20Z

Please ping me when you're ready for a review on this.

nelson-liu · 2019-01-24T00:11:11Z

Thanks @brendan-ai2 , will do! Still needs a bit of cleanup.

nelson-liu · 2019-01-24T02:46:47Z

Aside: I'm pretty sure the bidirectional contiguous text setting won't be correct until #2373 is fixed.

matt-gardner · 2019-01-24T04:27:07Z

Does bidirectional contiguous text work at all? I don't think you can train both directions at the same time in a stateful way, right?

nelson-liu · 2019-01-24T04:29:03Z

Right, it's currently broken because we need different contextualizers (different parameters) for the forward and backward. But this is what #2373 should fix, no? Or am i misunderstanding?

matt-gardner · 2019-01-24T04:31:02Z

The whole point of your dataset reader is to get contiguous chunks that let you see one batch after another of streaming text, in order. But it's in order only for one direction. A stateful backwards LSTM would do the wrong thing (even with #2373 fixed) if it sees forward-streaming batches.

nelson-liu · 2019-01-24T04:32:37Z

ah, i see what you mean now. i guess this could work if you have separate inputs for your forward and backward (i.e., forward starts from the first index, and backward from the last)?

matt-gardner · 2019-01-24T04:34:38Z

That would still break a bidirectional encoder (because the backward LSTM on the forward input would be wrong, and the forward LSTM on the backward input would be wrong). This method of training is only suitable for a single direction at a time.

nelson-liu · 2019-01-24T04:46:53Z

talked with @matt-gardner on slack, seems like the reasonable fix would be to have the model take 2 inputs (for each direction) and 2 targets (for each direction). Furthermore, the model would use a shared embedding layer for the inputs and two different contextualizers (one forward-direction, one backward-direction).

matt-peters · 2019-01-24T19:22:48Z

This is the way the original ELMo training code works. There are two separate LSTMs (one for each direction), but shared softmax layer and word embeddings. The data generator creates batches in exactly this manner, by returning both forward and reverse ids.

brendan-ai2 · 2019-01-24T21:49:23Z

@nelson-liu generously took the time to explain some of the delicate points of contiguous language modeling to me. Seems like we're all in agreement on having multiple contextualizers and targets.

mttk · 2019-03-29T14:42:00Z

allennlp/data/dataset_readers/language_modeling.py

+            # Read the contents of the file into one long list of tokens,
+            # adding start and/or end tokens as necessary.
+            file_tokens = []
+            file_tokens.extend(self._start_tokens)


Shouldn't this be in the forloop?

Yup, you're correct (sorry I haven't looked at this PR in awhile). In general, though, language modeling of contiguous text doesn't use any start tokens and only uses eos tokens (e.g., see /~https://github.com/salesforce/awd-lstm-lm/blob/master/data.py#L34-L54 )

yeah, although AFAIK ELMo actually used start tokens (/~https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#notes-on-statefulness-and-non-determinism)

Right, but this dataset reader is not for training a LM like ELMo. ELMo is trained on shuffled sentences, hence why it needs a start token for each sentence. This dataset reader is for contiguous text (e.g., books), like the corpora that the OpenAI GPT was trained on / the PTB LM benchmark most folks use. In this setting, people don't typically use start tokens.

Ah, got it, thanks -- I assumed that stateful (RNN) language models were always trained in contiguous fashion.
The shuffling bit might help generalization performance for sentence-level tasks (where the model doesn't learn to rely on an initialized hidden state), but I'm wondering how the same model with and without contiguous training would perform in transfer learning.
Thanks!

bratao · 2019-05-30T03:17:51Z

@nelson-liu This is a AWESOME feature and needed in many scenarios I'm working with.
Did you have plan to merging this PR?

matt-gardner · 2019-06-14T15:05:30Z

@nelson-liu, what's stopping this from getting merged?

nelson-liu · 2019-06-14T15:08:23Z

So we first have to figure out #2373 , which PR #2438 attempts to do. #2438 is blocked by #2438 (comment)

I also think that @rloganiv is working on contiguous-text language modeling, so we should try to avoid duplication of effort.

rloganiv · 2019-06-17T22:28:52Z

allennlp/data/dataset_readers/language_modeling.py

-        super().__init__(lazy)
+                 start_tokens: List[str] = None,
+                 end_tokens: List[str] = ["</S>"]) -> None:
+        super().__init__(lazy=False)


I think setting lazy=False defeats the purpose of using fuzz_truncated_bppt_size=True during training. The point of using random sequence lengths is to prevent batches ending with same tokens each epoch, however this will not happen if lazy=False since _read will only be called once. It probably makes more sense to let users set lazy themselves and raise an error if fuzz_truncated_bppt_size=True and lazy=False.

rloganiv · 2019-06-17T23:09:25Z

allennlp/data/iterators/language_modeling_iterator.py

+
+
+@DataIterator.register("language_modeling")
+class LanguageModelingIterator(BasicIterator):


I think it makes sense to give this object a more general name - it is useful for any problem where batching is done in the DatasetReader (e.g. the problem in #2828). Maybe something like StraightThroughIterator or TrivialIterator?

bratao · 2019-09-02T01:25:33Z

@nelson-liu sorry to ping you about this again. But I'm planning for an upcoming research that will need LM of Contiguous Text. Do you plan to resume this?

nelson-liu · 2019-09-02T05:18:05Z

yes, i would like to have this working at some point. I know @rloganiv was also working on this at some point, and @matt-gardner recently merged some LM things. I'm traveling until mid-september, but I'm hoping to try to make some progress on it afterwards.

nelson-liu · 2019-09-02T05:18:32Z

< sorry for dropping the ball on this. yes, i would like to have this working at some point. I know @rloganiv was also working on this at some point, and @matt-gardner recently merged some LM things. I'm traveling until mid-september, but I'm hoping to try to make some progress on it afterwards.

rloganiv · 2019-09-04T19:59:33Z

@bratao This PR is currently blocked due to the thorny issue of bi-directionality (see #2438). If you only need a forward generative LM then you should be able to use the existing LanguageModel implementation, with the following caveats:

You will need to set stateful=True for the contextualizer
Your dataset reader will need to handle batching (you could use the LanguageModelingReader in this PR), and you will need to use the PassThroughIterator as your iterator.

Hope this helps!

DeNeutoy · 2019-11-19T17:51:01Z

This PR has been open since Jan, i'm declaring it done - re-open if someone wants to fix it.

nelson-liu added 3 commits January 23, 2019 11:39

Fix LanguageModelingReader to generate instances for proper contiguou…

f43a1cc

…s text lm

Fix some lint issues in language_modeling_dataset_test

3cb61fb

Add initial version of LanguageModelingIterator

c46f798

nelson-liu force-pushed the contiguous_text_lm branch from 18b50d7 to c46f798 Compare January 23, 2019 19:40

nelson-liu and others added 10 commits January 23, 2019 12:52

Error in LM reader if batch size > num dataset tokens

dbb516e

Fix singleton dimension removal in LM iterator

a5762af

Make iterator more robust, since keys might not exist sometimes

94067f4

Use forward_targets key in LM reader

71c83c9

Add initial implementation of unidirectional contiguous text LM

8752dcf

Add TestUnidirectionalContiguousLanguageModelUnsampled

2e199d3

Add TestUnidirectionalContiguousLanguageModelTransformer

680bef7

Merge branch 'master' into contiguous_text_lm

e8151f5

Merge branch 'master' into contiguous_text_lm

5167d1a

Fix lint

6161bdf

nelson-liu and others added 5 commits January 23, 2019 17:39

Merge branch 'master' into contiguous_text_lm

4bfce46

Add bidirectionality argument to dataset reader

6a6cbfd

Fix end_tokens default in docstring

492239d

Fix datasetreader bidirectionality

7b44053

Test bidirectional contiguous LM

2749acc

Make LSTM in test fixture not cheat

638d06f

nelson-liu mentioned this pull request Jan 24, 2019

Fix LanguageModel API to easily handle bidirectional contextualizers. #2373

Closed

mttk reviewed Mar 29, 2019

View reviewed changes

matt-gardner mentioned this pull request May 11, 2019

[Feature Request] Splitting Long Sequences and Potential Modifications to Stateful Encoders #2828

Closed

rloganiv reviewed Jun 17, 2019

View reviewed changes

rloganiv mentioned this pull request Jun 27, 2019

PassThroughIterator #3015

Merged

DeNeutoy closed this Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Language Modeling of Contiguous Text #2414

[WIP] Language Modeling of Contiguous Text #2414

nelson-liu commented Jan 22, 2019 •

edited

Loading

nelson-liu commented Jan 22, 2019 •

edited

Loading

matt-gardner commented Jan 22, 2019

brendan-ai2 commented Jan 24, 2019

nelson-liu commented Jan 24, 2019

nelson-liu commented Jan 24, 2019

matt-gardner commented Jan 24, 2019

nelson-liu commented Jan 24, 2019 •

edited

Loading

matt-gardner commented Jan 24, 2019 •

edited

Loading

nelson-liu commented Jan 24, 2019

matt-gardner commented Jan 24, 2019

nelson-liu commented Jan 24, 2019 •

edited

Loading

matt-peters commented Jan 24, 2019

brendan-ai2 commented Jan 24, 2019

mttk Mar 29, 2019

nelson-liu Mar 31, 2019

mttk Mar 31, 2019

nelson-liu Mar 31, 2019

mttk Apr 1, 2019

bratao commented May 30, 2019

matt-gardner commented Jun 14, 2019

nelson-liu commented Jun 14, 2019

rloganiv Jun 17, 2019

rloganiv Jun 17, 2019 •

edited

Loading

bratao commented Sep 2, 2019

nelson-liu commented Sep 2, 2019

nelson-liu commented Sep 2, 2019

rloganiv commented Sep 4, 2019 •

edited

Loading

DeNeutoy commented Nov 19, 2019



		@DataIterator.register("language_modeling")
		class LanguageModelingIterator(BasicIterator):

[WIP] Language Modeling of Contiguous Text #2414

[WIP] Language Modeling of Contiguous Text #2414

Conversation

nelson-liu commented Jan 22, 2019 • edited Loading

nelson-liu commented Jan 22, 2019 • edited Loading

matt-gardner commented Jan 22, 2019

brendan-ai2 commented Jan 24, 2019

nelson-liu commented Jan 24, 2019

nelson-liu commented Jan 24, 2019

matt-gardner commented Jan 24, 2019

nelson-liu commented Jan 24, 2019 • edited Loading

matt-gardner commented Jan 24, 2019 • edited Loading

nelson-liu commented Jan 24, 2019

matt-gardner commented Jan 24, 2019

nelson-liu commented Jan 24, 2019 • edited Loading

matt-peters commented Jan 24, 2019

brendan-ai2 commented Jan 24, 2019

mttk Mar 29, 2019

Choose a reason for hiding this comment

nelson-liu Mar 31, 2019

Choose a reason for hiding this comment

mttk Mar 31, 2019

Choose a reason for hiding this comment

nelson-liu Mar 31, 2019

Choose a reason for hiding this comment

mttk Apr 1, 2019

Choose a reason for hiding this comment

bratao commented May 30, 2019

matt-gardner commented Jun 14, 2019

nelson-liu commented Jun 14, 2019

rloganiv Jun 17, 2019

Choose a reason for hiding this comment

rloganiv Jun 17, 2019 • edited Loading

Choose a reason for hiding this comment

bratao commented Sep 2, 2019

nelson-liu commented Sep 2, 2019

nelson-liu commented Sep 2, 2019

rloganiv commented Sep 4, 2019 • edited Loading

DeNeutoy commented Nov 19, 2019

nelson-liu commented Jan 22, 2019 •

edited

Loading

nelson-liu commented Jan 22, 2019 •

edited

Loading

nelson-liu commented Jan 24, 2019 •

edited

Loading

matt-gardner commented Jan 24, 2019 •

edited

Loading

nelson-liu commented Jan 24, 2019 •

edited

Loading

rloganiv Jun 17, 2019 •

edited

Loading

rloganiv commented Sep 4, 2019 •

edited

Loading