-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[WIP] Language Modeling of Contiguous Text #2414
Conversation
@matt-gardner / @brendan-ai2 : one annoying thing is that tensorizing the instances from the dataset reader would give you an extra singleton batch dimension --- is it reasonable to just remove that in the iterator (i think you can, but not positive)? |
I would say yes, it's reasonable to remove that extra dimension in the iterator, which is another strong motivation for having a language-modeling-specific iterator. |
18b50d7
to
c46f798
Compare
Please ping me when you're ready for a review on this. |
Thanks @brendan-ai2 , will do! Still needs a bit of cleanup. |
Aside: I'm pretty sure the bidirectional contiguous text setting won't be correct until #2373 is fixed. |
Does bidirectional contiguous text work at all? I don't think you can train both directions at the same time in a stateful way, right? |
Right, it's currently broken because we need different contextualizers (different parameters) for the forward and backward. But this is what #2373 should fix, no? Or am i misunderstanding? |
The whole point of your dataset reader is to get contiguous chunks that let you see one batch after another of streaming text, in order. But it's in order only for one direction. A stateful backwards LSTM would do the wrong thing (even with #2373 fixed) if it sees forward-streaming batches. |
ah, i see what you mean now. i guess this could work if you have separate inputs for your forward and backward (i.e., forward starts from the first index, and backward from the last)? |
That would still break a bidirectional encoder (because the backward LSTM on the forward input would be wrong, and the forward LSTM on the backward input would be wrong). This method of training is only suitable for a single direction at a time. |
talked with @matt-gardner on slack, seems like the reasonable fix would be to have the model take 2 inputs (for each direction) and 2 targets (for each direction). Furthermore, the model would use a shared embedding layer for the inputs and two different contextualizers (one forward-direction, one backward-direction). |
This is the way the original ELMo training code works. There are two separate LSTMs (one for each direction), but shared softmax layer and word embeddings. The data generator creates batches in exactly this manner, by returning both forward and reverse ids. |
@nelson-liu generously took the time to explain some of the delicate points of contiguous language modeling to me. Seems like we're all in agreement on having multiple contextualizers and targets. |
# Read the contents of the file into one long list of tokens, | ||
# adding start and/or end tokens as necessary. | ||
file_tokens = [] | ||
file_tokens.extend(self._start_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be in the forloop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you're correct (sorry I haven't looked at this PR in awhile). In general, though, language modeling of contiguous text doesn't use any start tokens and only uses eos tokens (e.g., see /~https://github.com/salesforce/awd-lstm-lm/blob/master/data.py#L34-L54 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, although AFAIK ELMo actually used start tokens (/~https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md#notes-on-statefulness-and-non-determinism)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but this dataset reader is not for training a LM like ELMo. ELMo is trained on shuffled sentences, hence why it needs a start token for each sentence. This dataset reader is for contiguous text (e.g., books), like the corpora that the OpenAI GPT was trained on / the PTB LM benchmark most folks use. In this setting, people don't typically use start tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it, thanks -- I assumed that stateful (RNN) language models were always trained in contiguous fashion.
The shuffling bit might help generalization performance for sentence-level tasks (where the model doesn't learn to rely on an initialized hidden state), but I'm wondering how the same model with and without contiguous training would perform in transfer learning.
Thanks!
@nelson-liu This is a AWESOME feature and needed in many scenarios I'm working with. |
@nelson-liu, what's stopping this from getting merged? |
So we first have to figure out #2373 , which PR #2438 attempts to do. #2438 is blocked by #2438 (comment) I also think that @rloganiv is working on contiguous-text language modeling, so we should try to avoid duplication of effort. |
super().__init__(lazy) | ||
start_tokens: List[str] = None, | ||
end_tokens: List[str] = ["</S>"]) -> None: | ||
super().__init__(lazy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting lazy=False
defeats the purpose of using fuzz_truncated_bppt_size=True
during training. The point of using random sequence lengths is to prevent batches ending with same tokens each epoch, however this will not happen if lazy=False
since _read
will only be called once. It probably makes more sense to let users set lazy
themselves and raise an error if fuzz_truncated_bppt_size=True
and lazy=False
.
|
||
|
||
@DataIterator.register("language_modeling") | ||
class LanguageModelingIterator(BasicIterator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to give this object a more general name - it is useful for any problem where batching is done in the DatasetReader
(e.g. the problem in #2828). Maybe something like StraightThroughIterator
or TrivialIterator
?
@nelson-liu sorry to ping you about this again. But I'm planning for an upcoming research that will need LM of Contiguous Text. Do you plan to resume this? |
yes, i would like to have this working at some point. I know @rloganiv was also working on this at some point, and @matt-gardner recently merged some LM things. I'm traveling until mid-september, but I'm hoping to try to make some progress on it afterwards. |
|
@bratao This PR is currently blocked due to the thorny issue of bi-directionality (see #2438). If you only need a forward generative LM then you should be able to use the existing
Hope this helps! |
This PR has been open since Jan, i'm declaring it done - re-open if someone wants to fix it. |
This PR enables language models with stateful encoders to properly work on contiguous text datasets (e.g., BookCorpus).
To Do:
LanguageModelingReader
actually usableLanguageModelingReader
LanguageModelingIterator
(or some other name) that turns off shuffling and also constrains the batch size to 1.LanguageModel
to handle instances fromLanguageModelingReader
LanguageModelingReader
andLanguageModel
LanguageModel