Data V2 #3700

DeNeutoy · 2020-01-30T20:17:56Z

Attempt 2 at better data loading:

Trainer now takes a DataLoader rather than the data itself and an iterator.
DatasetReaders now return a subclass of pytorch Dataset which contains instances, and calls index on them before returning them. Note that this means the responsibility of indexing the instances has moved from Iterator to the Dataset, so correspondingly it now has a def index_with(self, vocab: Vocabulary) method. If the vocab is None, the instances are not indexed before returning.
Adds some stubs so we can construct pytorch DataLoader, Sampler and BatchSampler objects from_params.
Lazy datasets no longer support bucketing. Instead, the dataset reader should order instances in a way such that they are naturally bucketed - e.g read from a shard, sort raw strings by length, and just use the basic DataLoader functionality. This may change in the future if the pytorch DataLoader accepts a BucketSampler with an IterableDataset, see Sampler for IterableDataset pytorch/pytorch#28743 and ChunkDataset API proposal pytorch/pytorch#26547
The Trainer is now essentially independent of how a user chooses to create their model inputs - a user doesn't have to use allennlp's data piece at all, because the trainer just accepts a dataloader, which passes it's batches to the model.

Before:

    "iterator": {
      "type": "bucket",
      "sorting_keys": [["tokens", "num_tokens"]],
      "batch_size" : 32
    },

New config using a batch sampler which buckets instances by length:

    "data_loader": {
	  "batch_sampler"{
      "type": "bucket",
      "sorting_keys": [["tokens", "num_tokens"]],
      "batch_size" : 32
    }
},

Vanilla config which does not use a sampler - in this case the dataloader will just use a sequential sampler by default.

    "data_loader": {
	  "num_workers": 5,
      "timeout": 4,
      "batch_size" : 32
},

Ok, I did some rough benchmarks on ESIM + SNLI and it seems that this method is a bit faster than previously, but i'm not seeing the speedup with the number of workers that I was expecting. My current hypothesis for that is that the data loading process is not slow for the ESIM config, as it doesn't do any fancy indexing etc. I will try out some slower data loading models, but perhaps we can do that after merging, as a bunch of other stuff is waiting on this PR.

matt-gardner

Overall this looks totally reasonable to me.

The Dataset stuff looks so similar to what we already have that we could maybe get away with not requiring any user changes to the data readers at all. One way that might be possible is to just change the base DatasetReader.read() method to return a Dataset, which is an object that implements __getitem__ on the list of instances returned by DatasetReader._read(). This has the benefit of maintaining the existing caching and lazy options, and if the dataset is lazy, we just return an IterableDataset that requires a different Sampler. Does this make sense?

matt-gardner · 2020-01-30T21:21:11Z

torch_datasets.py

+
+
+"""
+Here we have two SNLI readers in both of the different styles.


This is old? I only see one here.

matt-gardner · 2020-01-30T21:22:05Z

torch_datasets.py

+
+class SnliDataset(Dataset):
+    def __init__(
+        self, file_path: str, token_indexers: Dict[str, TokenIndexer] = None, lazy: bool = False


As long as there's some other object that separates the file_path argument from the token_indexers argument, I'm ok with this. The thing that's separated should be the one that's Registrable, though.

matt-gardner · 2020-01-30T21:23:12Z

torch_datasets.py

+                # These were cases where the annotators disagreed; we'll just skip them.  It's
+                # like 800 out of 500k examples in the training data.
+                continue
+            self.examples.append(example)


Probably better to have self.examples actually be self.instances, with actual Instance objects, because then we can cache them very easily.

matt-gardner · 2020-01-30T21:25:14Z

torch_datasets.py

+        raise NotImplementedError
+
+    def __getitem__(self) -> Instance:
+


I'd vote for having a default implementation here, like:

def __getitem__(self, index: int) -> Instance: if not self._instances: self.load_instances() # or something return self._instances[index]

matt-gardner · 2020-01-30T21:26:31Z

torch_datasets.py

+
+class BatchInstanceSampler(BatchSampler):
+
+    def __init__(self, data, batch_size: int, sorting_keys: List[Tuple[str, str]] = None, padding_noise: float = 0.1):


Also better here if there's something that separates the configuration from the data, so we can easily use the same configuration on multiple datasets (e.g., train and dev).

matt-gardner · 2020-01-30T21:27:51Z

torch_datasets.py

+        self._batch_size = batch_size
+        self.data = data
+
+    def _argsort_by_padding(self, instances: List[Instance]) -> List[int]:


Not sure why it's argsort instead of just sort.

The torch sampler and batchsampler classes return indices into your dataset. So this method is returning the positions of indices in the original dataset such that they would be nicely bucketed together.

matt-gardner · 2020-01-30T21:30:20Z

torch_datasets.py

+batch_sampler = BatchInstanceSampler(data, 4)
+
+
+def allennlp_collocate(batch):


collate, not collocate, if you want to match the pytorch name.

DeNeutoy · 2020-01-30T23:07:33Z

I think that does make sense, apart from pytorch really restricts the DataLoader when you are using a IterableDataset

which would make doing anything but the most naive dataset batching with an iterable dataset very difficult. Maybe this just means that for iterable datasets, you need to do this kind of thing in the dataset reader itself for now (for example, reading a shard of language modelling dataset, sort the sentences by length and yielding them will give you a reasonable approximation to bucketing, whilst not requiring a sampler).

Related issues:

pytorch/pytorch#28743

pytorch/pytorch#24915

pytorch/pytorch#26547

matt-gardner · 2020-01-31T00:44:13Z

Yeah, if we have a mechanism to support laziness, even if it's a bit more cumbersome than it currently is, I think it's ok. I'm pretty sure laziness is much less common than being able to store all of the data in memory, so we should optimize for the common use case, as long as there are workarounds available for the less common case.

The practical language modeling dataset readers that I've seen in allennlp already basically bypass our data iterators, anyway. So if that's the typical use case for laziness, we definitely should not be designing our iterators around them - they need something more than we can reasonably provide, anyway.

dirkgr

Did you update all those configs via regex or something?

dirkgr · 2020-02-26T00:32:29Z

allennlp/data/dataloader.py

+            num_workers=num_workers,
+            # NOTE: This default is different from the normal `None`.
+            # We assume that if you are using this class you are using an
+            # allennlp dataset of instances, which would require this.


What is different from the normal None?

Different from the normal, which is None. More clear?

dirkgr · 2020-02-26T00:56:13Z

allennlp/data/samplers/samplers.py

+
+    def __iter__(self) -> Iterable[List[int]]:
+
+        raise NotImplementedError


Why don't these need to be implemented?

I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?

Because they're abstract base classes, I guess? Should I change it to something else?

Oh I see, you're asking why none of the subclasses implement them - it's because they all also inherit from the pytorch ones which do implement __iter__ .

dirkgr · 2020-02-26T01:01:49Z

allennlp/training/trainer.py

-        num_validation_batches = val_iterator.get_num_batches(self._validation_data)
-        val_generator_tqdm = Tqdm.tqdm(val_generator, total=num_validation_batches)
+        val_generator_tqdm = Tqdm.tqdm(
+            iter(validation_data_loader), total=len(validation_data_loader)


Does it not work to just say Tqdm.tqdm(validation_data_loader), and then it will also work if __len__ is not available?

oh yep, nice 👍

matt-gardner

Awesome!

matt-gardner · 2020-02-26T16:19:43Z

allennlp/data/dataloader.py

+            num_workers=num_workers,
+            # NOTE: This default is different from the normal `None`.
+            # We assume that if you are using this class you are using an
+            # allennlp dataset of instances, which would require this.


Different from the normal, which is None. More clear?

matt-gardner · 2020-02-26T16:23:25Z

allennlp/data/samplers/bucket_batch_sampler.py

+        which fields need what type of padding, and in what order.
+
+        Specifying the right keys for this is a bit cryptic, so if this is not given we try to
+        auto-detect the right keys by iterating once through the data up front, reading all of the


"auto-detect the right keys by iterating through a few instances up front" ?

matt-gardner · 2020-02-26T16:25:36Z

allennlp/data/samplers/bucket_batch_sampler.py

+        Specifying the right keys for this is a bit cryptic, so if this is not given we try to
+        auto-detect the right keys by iterating once through the data up front, reading all of the
+        padding keys and seeing which one has the longest length.  We use that one for padding.
+        This should give reasonable results in most cases.


Is it worth giving some example cases where this isn't a reasonable default? "Some cases where it might not be the right thing to do are when you have a ListField[TextField], or when you have a really long, constant length ArrayField."

I can add that in if you say it's true, but I haven't thought about this deeply 😄

matt-gardner · 2020-02-26T16:43:05Z

allennlp/data/samplers/bucket_batch_sampler.py

+        When sorting by padding length, we add a bit of noise to the lengths, so that the sorting
+        isn't deterministic.  This parameter determines how much noise we add, as a percentage of
+        the actual padding value for each instance.
+    drop_last : `bool`


Give the default here.

matt-gardner · 2020-02-26T16:43:28Z

allennlp/data/samplers/bucket_batch_sampler.py

+        When you need to specify this yourself, you can create an instance from your dataset and
+        call `Instance.get_padding_lengths()` to see a list of all keys used in your data.  You
+        should give one or more of those as the sorting keys here.
+    batch_size : int, required.


Move this up one, so it's in order?

matt-gardner · 2020-02-26T16:51:08Z

allennlp/data/samplers/samplers.py

+
+    def __iter__(self) -> Iterable[List[int]]:
+
+        raise NotImplementedError


I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?

matt-gardner · 2020-02-26T16:51:53Z

allennlp/data/samplers/samplers.py

+@Sampler.register("sequential")
+class SequentialSampler(Sampler, data.SequentialSampler):
+    """
+    A registerable version of pytorch's


s/registerable/registrable/ on all of these.

matt-gardner · 2020-02-26T16:53:59Z

allennlp/data/samplers/samplers.py

@@ -0,0 +1,137 @@
+from typing import List, Iterable


Might be worth somewhere in here saying that you can just use the pytorch classes directly without issue if you aren't using FromParams.

matt-gardner · 2020-02-26T16:59:49Z

allennlp/training/trainer.py

+
+from allennlp.data import DataLoader
+
+from allennlp.data.iterators.data_iterator import TensorDict


Do you need to move this to somewhere else? Do we still need this type?

I will, in the next PR which removes the iterator stuff.

matt-gardner · 2020-02-26T17:01:58Z

allennlp/training/trainer.py

@@ -881,25 +877,27 @@ def from_partial_objects(
        if not optimizer_:
            optimizer_ = Optimizer.default(parameters)

-        batches_per_epoch = iterator.get_num_batches(train_data)
-        if batches_per_epoch == 1:  # get_num_batches returns 1 when it can't determine the answer
+        try:


I thought you gave an implementation. Oh, this is the data loader, not the dataset... Ok. But if we never call len() on the dataset itself, we don't need a default implementation there anymore, do we? Or does one of the samplers call len() on the dataset? (EDIT: this is thinking specifically of the lazy dataset, where __len__ returns 1)

The dataloader unfortunately calls len (and warns if you call len on a lazy dataset, but still requires it to be implemented, somewhat bizarrely).

DeNeutoy · 2020-02-26T17:33:27Z

Ok, i'm going to merge this and follow up with two PRs, one removing the iterator code and one updating the training configs.

davidstap · 2020-05-19T14:50:44Z

Is is possible to define batch size with maximum number of tokens (instead of a fixed batch size) in this setup? If so, where does this magic happen (I am unable to locate it)?

matt-gardner · 2020-05-19T16:17:50Z

@davidstap, no, we didn't implement that functionality in the new data code, as it didn't seem like it was widely used. We do have a BucketBatchSampler, which is significantly more efficient than other samplers, but we don't cap by number of tokens. If you would really like to see it added, please open a new issue for it.

dirkgr · 2020-05-19T21:18:05Z

Tobi already has this feature in his Bart work. If it's needed urgently, we can split it out of his pull request and get it in now.

…

On Tue, May 19, 2020, 09:18 Matt Gardner ***@***.***> wrote: @davidstap </~https://github.com/davidstap>, no, we didn't implement that functionality in the new data code, as it didn't seem like it was widely used. We do have a BucketBatchSampler, which is significantly more efficient than other samplers, but we don't cap by number of tokens. If you would really like to see it added, please open a new issue for it. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3700 (comment)>, or unsubscribe </~https://github.com/notifications/unsubscribe-auth/AAHAYPRJS5YT3BUVAISP4DTRSKWL3ANCNFSM4KN4H4GA> .

davidstap · 2020-05-20T05:39:59Z

That would be awesome @dirkgr

dirkgr · 2020-05-29T00:21:28Z

I just merged that change: /~https://github.com/allenai/allennlp/blob/master/allennlp/data/samplers/max_tokens_batch_sampler.py

Thanks to @Tobias-Rohde for getting this done!

matt-gardner reviewed Jan 30, 2020

View reviewed changes

DeNeutoy mentioned this pull request Feb 18, 2020

How to register a MultiProcessDatasetReader and config it in config.json? #3794

Closed

example for feedback

0c42cb9

DeNeutoy force-pushed the data-v2 branch from e6db838 to 0c42cb9 Compare February 19, 2020 17:50

DeNeutoy added 2 commits February 19, 2020 09:50

Merge branch 'master' into data-v2

5ffedfc

remove all existing multiprocessing

80049f8

matt-gardner mentioned this pull request Feb 19, 2020

Remove BasicIterator and make the padding_keys argument to BucketIterator optional #2079

Closed

DeNeutoy added 21 commits February 19, 2020 10:22

sneak torch datasets inside DatasetReader

6f58c2a

lint

1b3ad9a

trainer_v2, We Love To See It

effc445

datasets have index_with now, not iterators

9d44ad6

use iter, custom collate function in allennlp wrapper

7e89ea6

we don't even need the data in the trainer anymore

883b6d7

all trainer tests passing

56d022a

black

01e12f5

make find learning rate work

5aea291

update test fixtures to new config

f026946

get train command tests mostly working

5973b50

lazily construct samplers, index lazy datasets

a23f47a

Merge branch 'master' into data-v2

a76ea0a

update some fixtures

ebf3854

evaluate tests passing

57a67e5

all command tests passing

7d21ed8

lint

24a500c

update model test case, common and module tests passing

fb13769

fix test interdependence introduced by allenai#3762

ef5187f

more test interdependence

b1ea845

tests tests tests

0231616

DeNeutoy added 10 commits February 23, 2020 10:40

Merge branch 'master' into data-v2

d00e1a9

improve docstrings, build dataloader using partial_objects

c066804

flake

61c7b14

give dataloader a default implementation

2b56b14

safer default for DataLoader init

354010a

more coherent dir structure

568291d

update imports

a016103

Merge branch 'master' into data-v2

47db16a

add a test for the BucketBatchSampler

04fdb70

split bucket sampler into own file, tests

d1d5c4a

DeNeutoy requested review from matt-gardner and dirkgr February 25, 2020 23:36

DeNeutoy changed the title ~~[WIP] Data V2~~ Data V2 Feb 25, 2020

dirkgr approved these changes Feb 26, 2020

View reviewed changes

matt-gardner approved these changes Feb 26, 2020

View reviewed changes

DeNeutoy added 2 commits February 26, 2020 09:17

PR comments

5f0c8db

Merge branch 'master' into data-v2

6f63a53

DeNeutoy merged commit d5a9696 into allenai:master Feb 26, 2020

DeNeutoy deleted the data-v2 branch February 26, 2020 17:33

This was referenced Feb 27, 2020

How can I make batch instances(converting input text fields to tensor objects) faster or efficiently? #2962

Closed

Native pytorch Multiprocessing for data loading #3079

Closed

Add NLI models #3865

Merged

DeNeutoy mentioned this pull request Mar 3, 2020

Updates for iterators + legacy attention allenai/allennlp-reading-comprehension#26

Merged

epwalsh mentioned this pull request Jun 5, 2020

Fix bug with lazy data loading, un-implement __len__ on AllennlpLazyDataset #4328

Merged



		"""
		Here we have two SNLI readers in both of the different styles.


		class BatchInstanceSampler(BatchSampler):

		def __init__(self, data, batch_size: int, sorting_keys: List[Tuple[str, str]] = None, padding_noise: float = 0.1):

		batch_sampler = BatchInstanceSampler(data, 4)


		def allennlp_collocate(batch):


		def __iter__(self) -> Iterable[List[int]]:

		raise NotImplementedError


		from allennlp.data import DataLoader

		from allennlp.data.iterators.data_iterator import TensorDict

Data V2 #3700

Data V2 #3700

Conversation

DeNeutoy commented Jan 30, 2020 • edited Loading

matt-gardner left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeNeutoy commented Jan 30, 2020

matt-gardner commented Jan 31, 2020

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner Feb 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeNeutoy commented Feb 26, 2020

davidstap commented May 19, 2020

matt-gardner commented May 19, 2020

dirkgr commented May 19, 2020 via email

davidstap commented May 20, 2020

dirkgr commented May 29, 2020

DeNeutoy commented Jan 30, 2020 •

edited

Loading

matt-gardner left a comment •

edited

Loading

matt-gardner Feb 26, 2020 •

edited

Loading