pieces for multitask learning #2369

joelgrus · 2019-01-16T16:22:31Z

as discussed at happy hour, this includes an InterleavingDatasetReader which wraps multiple other dataset readers and interleaves their instances (adding a MetadataField indicating which dataset each instance came from) and a HomogeneousBatchIterator, which assumes such a MetadataField exists and constructs batches that are homogeneous with respect to its value.

The only "weird" thing is that the file_path passed to InterleavingDatasetReader.read() needs to be a JSON-serialized dict { wrapped_reader_key -> file_path }. We discussed alternative designs like passing in a directory and requiring each wrapped reader to know to look for a specific file under the provided directory; I felt like that seemed a little bit too prescriptive about data layout and harder to configure.

I believe that with these pieces most of the multitask things that S2 research wants to do should be relatively easy. (Notably, with the file-path-as-JSON-dict innovation, we can just use the usual Trainer 😬 )

FYI @amandalynne

DeNeutoy

LGTM

DeNeutoy · 2019-01-17T00:28:18Z

allennlp/data/iterators/homogeneous_batch_iterator.py

+from allennlp.data.instance import Instance
+from allennlp.data.iterators.data_iterator import DataIterator
+
+@DataIterator.register("homogeneous-batch")


By convention this should be registered as homogeneous_batch

DeNeutoy · 2019-01-17T00:34:15Z

allennlp/data/iterators/homogeneous_batch_iterator.py

+        If false, it will do the tensorization anew each iteration.
+    track_epoch : ``bool``, optional, (default = False)
+        If true, each instance will get a ``MetadataField`` containing the epoch number.
+    partition_key : ``str``, optional, (default = "dataset")


This is a bit gross, I wonder if it's better to allow setting an "origin" attribute on an Instance or something. Maybe not for this PR

it needs to make it into the model, so if you did that you'd have to change all the "batch to tensor" logic to account for that and then make sure it doesn't somehow collide with other tensors and so on

pieces for multitask learning

2510fa1

joelgrus requested a review from DeNeutoy January 16, 2019 16:22

joelgrus added 3 commits January 16, 2019 08:49

mypy

ac4bfeb

add docs

0bd7e61

fix docs

82fc479

DeNeutoy approved these changes Jan 17, 2019

View reviewed changes

joelgrus added 2 commits January 16, 2019 16:54

dashes to underscores

fa7d86a

Merge branch 'master' into interleaving-dataset-reader

85e3f46

joelgrus merged commit 385e66e into allenai:master Jan 17, 2019

MaksymDel mentioned this pull request Jun 11, 2019

Multilingual parser and Cross-lingual ELMo #2628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pieces for multitask learning #2369

pieces for multitask learning #2369

joelgrus commented Jan 16, 2019

DeNeutoy left a comment

DeNeutoy Jan 17, 2019

DeNeutoy Jan 17, 2019

joelgrus Jan 17, 2019

pieces for multitask learning #2369

pieces for multitask learning #2369

Conversation

joelgrus commented Jan 16, 2019

DeNeutoy left a comment

Choose a reason for hiding this comment

DeNeutoy Jan 17, 2019

Choose a reason for hiding this comment

DeNeutoy Jan 17, 2019

Choose a reason for hiding this comment

joelgrus Jan 17, 2019

Choose a reason for hiding this comment