-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix handling of "datasets_for_vocab_creation" param #4350
Conversation
This is exactly the kind of bug that makes me stay away from Python falsyness in virtually all cases. |
) | ||
# Do a quick sanity check here. There's no need to load any datasets if the vocab | ||
# type is "empty". | ||
if datasets_for_vocab_creation is None and vocab_params.get("type") == "empty": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about this ad hoc solution here but I don't see another way. At least I added a test for this case to make sure we don't get a regression in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible alternative is to build a lazy generator here that doesn't actually call .read()
on anything unless it's needed. Not sure how feasible that is, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One obvious concern is that it requires changes to datasets_from_params
also in ways that change the return type of that method; I'm pretty sure, though that this is the only use of datasets_from_params
, and this function is only ever called on the master process in a distributed setting. In every other configuration, we go through a different code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also called from the find_learning_rate
command, but that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another reason why I think readers should always be lazy. Leave the lazy/non-lazy decision up to the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see that. Where do you make the lazy / non-lazy decision, then? You have to make a choice once you get to the DataLoader
, because of how batching / sampling works. You just base the decision of which dataset to use on how you've decided to do data loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was either thinking we'd make that decision part of the DataLoader
API, or have a separate configuration for the type of Dataset
to use.
I like the first option because the lazy vs. non-lazy is literally an issue of data loading. The downside though is that we'd be breaking from the PyTorch API. But having to separately configure a DatasetReader
, Dataset
, and DataLoader
seems overboard.
Lol, I discovered this some time ago, but I had thought it was on purpose. @epwalsh What is the best way to avoid building vocabulary when I use prebuilt transformers? Should I use |
@mateuszpieniak just using |
This fixes a couple of bugs related to the
datasets_for_vocab_creation
config parameter.The main bug is that
datasets_for_vocab_creation: []
ends up being treated likedatasets_for_vocab_creation: null
due to this statement incommands.train.TrainModel.from_partial_objects
:which should be
The 2nd bug occurs when
training.util.vocab_from_params(...)
is called. Even ifdatasets_for_vocab_creation
is set to[]
, all of the datasets will be initialized. This means that with non-lazy readers, you read all of the data unnecessarily.