-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks totally reasonable to me.
The Dataset
stuff looks so similar to what we already have that we could maybe get away with not requiring any user changes to the data readers at all. One way that might be possible is to just change the base DatasetReader.read()
method to return a Dataset
, which is an object that implements __getitem__
on the list of instances returned by DatasetReader._read()
. This has the benefit of maintaining the existing caching and lazy options, and if the dataset is lazy, we just return an IterableDataset
that requires a different Sampler
. Does this make sense?
torch_datasets.py
Outdated
|
||
|
||
""" | ||
Here we have two SNLI readers in both of the different styles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old? I only see one here.
torch_datasets.py
Outdated
|
||
class SnliDataset(Dataset): | ||
def __init__( | ||
self, file_path: str, token_indexers: Dict[str, TokenIndexer] = None, lazy: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as there's some other object that separates the file_path
argument from the token_indexers
argument, I'm ok with this. The thing that's separated should be the one that's Registrable
, though.
torch_datasets.py
Outdated
# These were cases where the annotators disagreed; we'll just skip them. It's | ||
# like 800 out of 500k examples in the training data. | ||
continue | ||
self.examples.append(example) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to have self.examples
actually be self.instances
, with actual Instance
objects, because then we can cache them very easily.
torch_datasets.py
Outdated
raise NotImplementedError | ||
|
||
def __getitem__(self) -> Instance: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote for having a default implementation here, like:
def __getitem__(self, index: int) -> Instance:
if not self._instances:
self.load_instances() # or something
return self._instances[index]
torch_datasets.py
Outdated
|
||
class BatchInstanceSampler(BatchSampler): | ||
|
||
def __init__(self, data, batch_size: int, sorting_keys: List[Tuple[str, str]] = None, padding_noise: float = 0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also better here if there's something that separates the configuration from the data, so we can easily use the same configuration on multiple datasets (e.g., train and dev).
torch_datasets.py
Outdated
self._batch_size = batch_size | ||
self.data = data | ||
|
||
def _argsort_by_padding(self, instances: List[Instance]) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why it's argsort
instead of just sort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torch sampler and batchsampler classes return indices into your dataset. So this method is returning the positions of indices in the original dataset such that they would be nicely bucketed together.
torch_datasets.py
Outdated
batch_sampler = BatchInstanceSampler(data, 4) | ||
|
||
|
||
def allennlp_collocate(batch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collate
, not collocate
, if you want to match the pytorch name.
I think that does make sense, apart from pytorch really restricts the which would make doing anything but the most naive dataset batching with an iterable dataset very difficult. Maybe this just means that for iterable datasets, you need to do this kind of thing in the dataset reader itself for now (for example, reading a shard of language modelling dataset, sort the sentences by length and yielding them will give you a reasonable approximation to bucketing, whilst not requiring a sampler). Related issues: |
Yeah, if we have a mechanism to support laziness, even if it's a bit more cumbersome than it currently is, I think it's ok. I'm pretty sure laziness is much less common than being able to store all of the data in memory, so we should optimize for the common use case, as long as there are workarounds available for the less common case. The practical language modeling dataset readers that I've seen in allennlp already basically bypass our data iterators, anyway. So if that's the typical use case for laziness, we definitely should not be designing our iterators around them - they need something more than we can reasonably provide, anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you update all those configs via regex or something?
num_workers=num_workers, | ||
# NOTE: This default is different from the normal `None`. | ||
# We assume that if you are using this class you are using an | ||
# allennlp dataset of instances, which would require this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is different from the normal None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different from the normal, which is None
. More clear?
|
||
def __iter__(self) -> Iterable[List[int]]: | ||
|
||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't these need to be implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they're abstract base classes, I guess? Should I change it to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you're asking why none of the subclasses implement them - it's because they all also inherit from the pytorch ones which do implement __iter__
.
allennlp/training/trainer.py
Outdated
num_validation_batches = val_iterator.get_num_batches(self._validation_data) | ||
val_generator_tqdm = Tqdm.tqdm(val_generator, total=num_validation_batches) | ||
val_generator_tqdm = Tqdm.tqdm( | ||
iter(validation_data_loader), total=len(validation_data_loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it not work to just say Tqdm.tqdm(validation_data_loader)
, and then it will also work if __len__
is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yep, nice 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
num_workers=num_workers, | ||
# NOTE: This default is different from the normal `None`. | ||
# We assume that if you are using this class you are using an | ||
# allennlp dataset of instances, which would require this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different from the normal, which is None
. More clear?
which fields need what type of padding, and in what order. | ||
|
||
Specifying the right keys for this is a bit cryptic, so if this is not given we try to | ||
auto-detect the right keys by iterating once through the data up front, reading all of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"auto-detect the right keys by iterating through a few instances up front" ?
Specifying the right keys for this is a bit cryptic, so if this is not given we try to | ||
auto-detect the right keys by iterating once through the data up front, reading all of the | ||
padding keys and seeing which one has the longest length. We use that one for padding. | ||
This should give reasonable results in most cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth giving some example cases where this isn't a reasonable default? "Some cases where it might not be the right thing to do are when you have a ListField[TextField]
, or when you have a really long, constant length ArrayField
."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that in if you say it's true, but I haven't thought about this deeply 😄
When sorting by padding length, we add a bit of noise to the lengths, so that the sorting | ||
isn't deterministic. This parameter determines how much noise we add, as a percentage of | ||
the actual padding value for each instance. | ||
drop_last : `bool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give the default here.
When you need to specify this yourself, you can create an instance from your dataset and | ||
call `Instance.get_padding_lengths()` to see a list of all keys used in your data. You | ||
should give one or more of those as the sorting keys here. | ||
batch_size : int, required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this up one, so it's in order?
|
||
def __iter__(self) -> Iterable[List[int]]: | ||
|
||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting to see the subclasses implement this. But the point is that the subclasses will get it from their pytorch superclasses instead of this one, and this exists for mypy?
allennlp/data/samplers/samplers.py
Outdated
@Sampler.register("sequential") | ||
class SequentialSampler(Sampler, data.SequentialSampler): | ||
""" | ||
A registerable version of pytorch's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/registerable/registrable/
on all of these.
@@ -0,0 +1,137 @@ | |||
from typing import List, Iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth somewhere in here saying that you can just use the pytorch classes directly without issue if you aren't using FromParams
.
|
||
from allennlp.data import DataLoader | ||
|
||
from allennlp.data.iterators.data_iterator import TensorDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to move this to somewhere else? Do we still need this type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, in the next PR which removes the iterator stuff.
@@ -881,25 +877,27 @@ def from_partial_objects( | |||
if not optimizer_: | |||
optimizer_ = Optimizer.default(parameters) | |||
|
|||
batches_per_epoch = iterator.get_num_batches(train_data) | |||
if batches_per_epoch == 1: # get_num_batches returns 1 when it can't determine the answer | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you gave an implementation. Oh, this is the data loader, not the dataset... Ok. But if we never call len()
on the dataset itself, we don't need a default implementation there anymore, do we? Or does one of the samplers call len()
on the dataset? (EDIT: this is thinking specifically of the lazy dataset, where __len__
returns 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataloader unfortunately calls len (and warns if you call len on a lazy dataset, but still requires it to be implemented, somewhat bizarrely).
Ok, i'm going to merge this and follow up with two PRs, one removing the iterator code and one updating the training configs. |
Is is possible to define batch size with maximum number of tokens (instead of a fixed batch size) in this setup? If so, where does this magic happen (I am unable to locate it)? |
@davidstap, no, we didn't implement that functionality in the new data code, as it didn't seem like it was widely used. We do have a |
Tobi already has this feature in his Bart work. If it's needed urgently, we
can split it out of his pull request and get it in now.
…On Tue, May 19, 2020, 09:18 Matt Gardner ***@***.***> wrote:
@davidstap </~https://github.com/davidstap>, no, we didn't implement that
functionality in the new data code, as it didn't seem like it was widely
used. We do have a BucketBatchSampler, which is significantly more
efficient than other samplers, but we don't cap by number of tokens. If you
would really like to see it added, please open a new issue for it.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3700 (comment)>,
or unsubscribe
</~https://github.com/notifications/unsubscribe-auth/AAHAYPRJS5YT3BUVAISP4DTRSKWL3ANCNFSM4KN4H4GA>
.
|
That would be awesome @dirkgr |
I just merged that change: /~https://github.com/allenai/allennlp/blob/master/allennlp/data/samplers/max_tokens_batch_sampler.py Thanks to @Tobias-Rohde for getting this done! |
Attempt 2 at better data loading:
Trainer
now takes aDataLoader
rather than the data itself and an iterator.DatasetReaders
now return a subclass of pytorchDataset
which contains instances, and callsindex
on them before returning them. Note that this means the responsibility of indexing the instances has moved fromIterator
to theDataset
, so correspondingly it now has adef index_with(self, vocab: Vocabulary)
method. If the vocab is None, the instances are not indexed before returning.DataLoader
,Sampler
andBatchSampler
objectsfrom_params
.DataLoader
functionality. This may change in the future if the pytorchDataLoader
accepts aBucketSampler
with anIterableDataset
, see Sampler for IterableDataset pytorch/pytorch#28743 and ChunkDataset API proposal pytorch/pytorch#26547Trainer
is now essentially independent of how a user chooses to create their model inputs - a user doesn't have to use allennlp's data piece at all, because the trainer just accepts a dataloader, which passes it's batches to the model.Before:
New config using a batch sampler which buckets instances by length:
Vanilla config which does not use a sampler - in this case the dataloader will just use a sequential sampler by default.
Ok, I did some rough benchmarks on ESIM + SNLI and it seems that this method is a bit faster than previously, but i'm not seeing the speedup with the number of workers that I was expecting. My current hypothesis for that is that the data loading process is not slow for the ESIM config, as it doesn't do any fancy indexing etc. I will try out some slower data loading models, but perhaps we can do that after merging, as a bunch of other stuff is waiting on this PR.