Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Data V2 #3700

Merged
merged 59 commits into from
Feb 26, 2020
Merged

Data V2 #3700

merged 59 commits into from
Feb 26, 2020

Conversation

DeNeutoy
Copy link
Contributor

@DeNeutoy DeNeutoy commented Jan 30, 2020

Attempt 2 at better data loading:

  • Trainer now takes a DataLoader rather than the data itself and an iterator.
  • DatasetReaders now return a subclass of pytorch Dataset which contains instances, and calls index on them before returning them. Note that this means the responsibility of indexing the instances has moved from Iterator to the Dataset, so correspondingly it now has a def index_with(self, vocab: Vocabulary) method. If the vocab is None, the instances are not indexed before returning.
  • Adds some stubs so we can construct pytorch DataLoader, Sampler and BatchSampler objects from_params.
  • Lazy datasets no longer support bucketing. Instead, the dataset reader should order instances in a way such that they are naturally bucketed - e.g read from a shard, sort raw strings by length, and just use the basic DataLoader functionality. This may change in the future if the pytorch DataLoader accepts a BucketSampler with an IterableDataset, see Sampler for IterableDataset pytorch/pytorch#28743 and ChunkDataset API proposal pytorch/pytorch#26547
  • The Trainer is now essentially independent of how a user chooses to create their model inputs - a user doesn't have to use allennlp's data piece at all, because the trainer just accepts a dataloader, which passes it's batches to the model.

Before:

    "iterator": {
      "type": "bucket",
      "sorting_keys": [["tokens", "num_tokens"]],
      "batch_size" : 32
    },

New config using a batch sampler which buckets instances by length:

    "data_loader": {
	  "batch_sampler"{
      "type": "bucket",
      "sorting_keys": [["tokens", "num_tokens"]],
      "batch_size" : 32
    }
},

Vanilla config which does not use a sampler - in this case the dataloader will just use a sequential sampler by default.

    "data_loader": {
	  "num_workers": 5,
      "timeout": 4,
      "batch_size" : 32
},

Ok, I did some rough benchmarks on ESIM + SNLI and it seems that this method is a bit faster than previously, but i'm not seeing the speedup with the number of workers that I was expecting. My current hypothesis for that is that the data loading process is not slow for the ESIM config, as it doesn't do any fancy indexing etc. I will try out some slower data loading models, but perhaps we can do that after merging, as a bunch of other stuff is waiting on this PR.

Screen Shot 2020-02-25 at 3 31 58 PM

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks totally reasonable to me.

The Dataset stuff looks so similar to what we already have that we could maybe get away with not requiring any user changes to the data readers at all. One way that might be possible is to just change the base DatasetReader.read() method to return a Dataset, which is an object that implements __getitem__ on the list of instances returned by DatasetReader._read(). This has the benefit of maintaining the existing caching and lazy options, and if the dataset is lazy, we just return an IterableDataset that requires a different Sampler. Does this make sense?



"""
Here we have two SNLI readers in both of the different styles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old? I only see one here.


class SnliDataset(Dataset):
def __init__(
self, file_path: str, token_indexers: Dict[str, TokenIndexer] = None, lazy: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as there's some other object that separates the file_path argument from the token_indexers argument, I'm ok with this. The thing that's separated should be the one that's Registrable, though.

# These were cases where the annotators disagreed; we'll just skip them. It's
# like 800 out of 500k examples in the training data.
continue
self.examples.append(example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to have self.examples actually be self.instances, with actual Instance objects, because then we can cache them very easily.

raise NotImplementedError

def __getitem__(self) -> Instance:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for having a default implementation here, like:

def __getitem__(self, index: int) -> Instance:
    if not self._instances:
        self.load_instances()  # or something
    return self._instances[index]


class BatchInstanceSampler(BatchSampler):

def __init__(self, data, batch_size: int, sorting_keys: List[Tuple[str, str]] = None, padding_noise: float = 0.1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also better here if there's something that separates the configuration from the data, so we can easily use the same configuration on multiple datasets (e.g., train and dev).

self._batch_size = batch_size
self.data = data

def _argsort_by_padding(self, instances: List[Instance]) -> List[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it's argsort instead of just sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch sampler and batchsampler classes return indices into your dataset. So this method is returning the positions of indices in the original dataset such that they would be nicely bucketed together.

batch_sampler = BatchInstanceSampler(data, 4)


def allennlp_collocate(batch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collate, not collocate, if you want to match the pytorch name.

@DeNeutoy
Copy link
Contributor Author

I think that does make sense, apart from pytorch really restricts the DataLoader when you are using a IterableDataset

which would make doing anything but the most naive dataset batching with an iterable dataset very difficult. Maybe this just means that for iterable datasets, you need to do this kind of thing in the dataset reader itself for now (for example, reading a shard of language modelling dataset, sort the sentences by length and yielding them will give you a reasonable approximation to bucketing, whilst not requiring a sampler).

Related issues:

pytorch/pytorch#28743

pytorch/pytorch#24915

pytorch/pytorch#26547

@matt-gardner
Copy link
Contributor

Yeah, if we have a mechanism to support laziness, even if it's a bit more cumbersome than it currently is, I think it's ok. I'm pretty sure laziness is much less common than being able to store all of the data in memory, so we should optimize for the common use case, as long as there are workarounds available for the less common case.

The practical language modeling dataset readers that I've seen in allennlp already basically bypass our data iterators, anyway. So if that's the typical use case for laziness, we definitely should not be designing our iterators around them - they need something more than we can reasonably provide, anyway.

@DeNeutoy DeNeutoy changed the title [WIP] Data V2 Data V2 Feb 25, 2020
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you update all those configs via regex or something?

num_workers=num_workers,
# NOTE: This default is different from the normal `None`.
# We assume that if you are using this class you are using an
# allennlp dataset of instances, which would require this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is different from the normal None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from the normal, which is None. More clear?


def __iter__(self) -> Iterable[List[int]]:

raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't these need to be implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they're abstract base classes, I guess? Should I change it to something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you're asking why none of the subclasses implement them - it's because they all also inherit from the pytorch ones which do implement __iter__ .

num_validation_batches = val_iterator.get_num_batches(self._validation_data)
val_generator_tqdm = Tqdm.tqdm(val_generator, total=num_validation_batches)
val_generator_tqdm = Tqdm.tqdm(
iter(validation_data_loader), total=len(validation_data_loader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it not work to just say Tqdm.tqdm(validation_data_loader), and then it will also work if __len__ is not available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yep, nice 👍

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

num_workers=num_workers,
# NOTE: This default is different from the normal `None`.
# We assume that if you are using this class you are using an
# allennlp dataset of instances, which would require this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from the normal, which is None. More clear?

which fields need what type of padding, and in what order.

Specifying the right keys for this is a bit cryptic, so if this is not given we try to
auto-detect the right keys by iterating once through the data up front, reading all of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"auto-detect the right keys by iterating through a few instances up front" ?

Specifying the right keys for this is a bit cryptic, so if this is not given we try to
auto-detect the right keys by iterating once through the data up front, reading all of the
padding keys and seeing which one has the longest length. We use that one for padding.
This should give reasonable results in most cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth giving some example cases where this isn't a reasonable default? "Some cases where it might not be the right thing to do are when you have a ListField[TextField], or when you have a really long, constant length ArrayField."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that in if you say it's true, but I haven't thought about this deeply 😄

When sorting by padding length, we add a bit of noise to the lengths, so that the sorting
isn't deterministic. This parameter determines how much noise we add, as a percentage of
the actual padding value for each instance.
drop_last : `bool`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give the default here.

When you need to specify this yourself, you can create an instance from your dataset and
call `Instance.get_padding_lengths()` to see a list of all keys used in your data. You
should give one or more of those as the sorting keys here.
batch_size : int, required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this up one, so it's in order?


def __iter__(self) -> Iterable[List[int]]:

raise NotImplementedError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?

@Sampler.register("sequential")
class SequentialSampler(Sampler, data.SequentialSampler):
"""
A registerable version of pytorch's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/registerable/registrable/ on all of these.

@@ -0,0 +1,137 @@
from typing import List, Iterable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth somewhere in here saying that you can just use the pytorch classes directly without issue if you aren't using FromParams.


from allennlp.data import DataLoader

from allennlp.data.iterators.data_iterator import TensorDict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to move this to somewhere else? Do we still need this type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will, in the next PR which removes the iterator stuff.

@@ -881,25 +877,27 @@ def from_partial_objects(
if not optimizer_:
optimizer_ = Optimizer.default(parameters)

batches_per_epoch = iterator.get_num_batches(train_data)
if batches_per_epoch == 1: # get_num_batches returns 1 when it can't determine the answer
try:
Copy link
Contributor

@matt-gardner matt-gardner Feb 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you gave an implementation. Oh, this is the data loader, not the dataset... Ok. But if we never call len() on the dataset itself, we don't need a default implementation there anymore, do we? Or does one of the samplers call len() on the dataset? (EDIT: this is thinking specifically of the lazy dataset, where __len__ returns 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataloader unfortunately calls len (and warns if you call len on a lazy dataset, but still requires it to be implemented, somewhat bizarrely).

@DeNeutoy
Copy link
Contributor Author

Ok, i'm going to merge this and follow up with two PRs, one removing the iterator code and one updating the training configs.

@davidstap
Copy link

Is is possible to define batch size with maximum number of tokens (instead of a fixed batch size) in this setup? If so, where does this magic happen (I am unable to locate it)?

@matt-gardner
Copy link
Contributor

@davidstap, no, we didn't implement that functionality in the new data code, as it didn't seem like it was widely used. We do have a BucketBatchSampler, which is significantly more efficient than other samplers, but we don't cap by number of tokens. If you would really like to see it added, please open a new issue for it.

@dirkgr
Copy link
Member

dirkgr commented May 19, 2020 via email

@davidstap
Copy link

That would be awesome @dirkgr

@dirkgr
Copy link
Member

dirkgr commented May 29, 2020

I just merged that change: /~https://github.com/allenai/allennlp/blob/master/allennlp/data/samplers/max_tokens_batch_sampler.py

Thanks to @Tobias-Rohde for getting this done!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants