Fix handling of "datasets_for_vocab_creation" param #4350

epwalsh · 2020-06-10T20:13:20Z

This fixes a couple of bugs related to the datasets_for_vocab_creation config parameter.

The main bug is that datasets_for_vocab_creation: [] ends up being treated like datasets_for_vocab_creation: null due to this statement in commands.train.TrainModel.from_partial_objects:

        instance_generator = (
            instance
            for key, dataset in datasets.items()
            if not datasets_for_vocab_creation or key in datasets_for_vocab_creation
            for instance in dataset
        )

which should be

        instance_generator = (
            instance
            for key, dataset in datasets.items()
            if datasets_for_vocab_creation is None or key in datasets_for_vocab_creation
            for instance in dataset
        )

The 2nd bug occurs when training.util.vocab_from_params(...) is called. Even if datasets_for_vocab_creation is set to [], all of the datasets will be initialized. This means that with non-lazy readers, you read all of the data unnecessarily.

allennlp/commands/train.py

allennlp/training/util.py

dirkgr · 2020-06-11T09:42:57Z

This is exactly the kind of bug that makes me stay away from Python falsyness in virtually all cases.

allennlp/training/util.py

epwalsh · 2020-06-12T00:03:41Z

allennlp/training/util.py

    )
+    # Do a quick sanity check here. There's no need to load any datasets if the vocab
+    # type is "empty".
+    if datasets_for_vocab_creation is None and vocab_params.get("type") == "empty":


I'm not crazy about this ad hoc solution here but I don't see another way. At least I added a test for this case to make sure we don't get a regression in the future.

A possible alternative is to build a lazy generator here that doesn't actually call .read() on anything unless it's needed. Not sure how feasible that is, though.

One obvious concern is that it requires changes to datasets_from_params also in ways that change the return type of that method; I'm pretty sure, though that this is the only use of datasets_from_params, and this function is only ever called on the master process in a distributed setting. In every other configuration, we go through a different code path.

It's also called from the find_learning_rate command, but that's it.

This is another reason why I think readers should always be lazy. Leave the lazy/non-lazy decision up to the dataset.

Yeah, I can see that. Where do you make the lazy / non-lazy decision, then? You have to make a choice once you get to the DataLoader, because of how batching / sampling works. You just base the decision of which dataset to use on how you've decided to do data loading?

I was either thinking we'd make that decision part of the DataLoader API, or have a separate configuration for the type of Dataset to use.

I like the first option because the lazy vs. non-lazy is literally an issue of data loading. The downside though is that we'd be breaking from the PyTorch API. But having to separately configure a DatasetReader, Dataset, and DataLoader seems overboard.

elkotito · 2020-06-29T19:55:34Z

Lol, I discovered this some time ago, but I had thought it was on purpose. @epwalsh What is the best way to avoid building vocabulary when I use prebuilt transformers? Should I use datasets_for_vocab_creation: [] or is there something else?

epwalsh · 2020-06-29T19:57:07Z

@mateuszpieniak just using {vocab: {type: empty}} will work now.

epwalsh added 3 commits June 10, 2020 12:50

handle 'datasets_for_vocab_creation' properly

f5e5bb2

another fix

e5906fe

update CHANGELOG

817698c

epwalsh requested review from dirkgr and matt-gardner June 10, 2020 20:13

epwalsh changed the title ~~Fix data loading~~ Fix handling of "datasets_for_vocab_creation" param Jun 10, 2020

matt-gardner reviewed Jun 10, 2020

View reviewed changes

allennlp/commands/train.py Show resolved Hide resolved

allennlp/training/util.py Outdated Show resolved Hide resolved

allennlp/training/util.py Outdated Show resolved Hide resolved

allennlp/training/util.py Outdated Show resolved Hide resolved

epwalsh added 3 commits June 10, 2020 14:29

fixes

67780b7

fix CHANGELOG

58d99ac

fix

f631427

epwalsh commented Jun 10, 2020

View reviewed changes

allennlp/training/util.py Outdated Show resolved Hide resolved

dirkgr approved these changes Jun 11, 2020

View reviewed changes

matt-gardner approved these changes Jun 11, 2020

View reviewed changes

allennlp/training/util.py Outdated Show resolved Hide resolved

epwalsh added 3 commits June 11, 2020 15:24

Merge branch 'master' into fix-data-loading

553a8e6

updaet CHANGELOG

60dbc24

tests and improvements

518421d

epwalsh commented Jun 12, 2020

View reviewed changes

revert small change

0f24956

schmmd added this to the 1.0.0 milestone Jun 12, 2020

epwalsh merged commit 87c23e4 into allenai:master Jun 12, 2020

epwalsh deleted the fix-data-loading branch June 12, 2020 19:36

epwalsh mentioned this pull request Jun 13, 2020

ensure 'from_files' vocab doesn't load instances #4356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of "datasets_for_vocab_creation" param #4350

Fix handling of "datasets_for_vocab_creation" param #4350

epwalsh commented Jun 10, 2020

dirkgr commented Jun 11, 2020

epwalsh Jun 12, 2020

matt-gardner Jun 12, 2020

matt-gardner Jun 12, 2020

epwalsh Jun 12, 2020

epwalsh Jun 12, 2020

matt-gardner Jun 12, 2020

epwalsh Jun 12, 2020

elkotito commented Jun 29, 2020

epwalsh commented Jun 29, 2020

Fix handling of "datasets_for_vocab_creation" param #4350

Fix handling of "datasets_for_vocab_creation" param #4350

Conversation

epwalsh commented Jun 10, 2020

dirkgr commented Jun 11, 2020

epwalsh Jun 12, 2020

Choose a reason for hiding this comment

matt-gardner Jun 12, 2020

Choose a reason for hiding this comment

matt-gardner Jun 12, 2020

Choose a reason for hiding this comment

epwalsh Jun 12, 2020

Choose a reason for hiding this comment

epwalsh Jun 12, 2020

Choose a reason for hiding this comment

matt-gardner Jun 12, 2020

Choose a reason for hiding this comment

epwalsh Jun 12, 2020

Choose a reason for hiding this comment

elkotito commented Jun 29, 2020

epwalsh commented Jun 29, 2020