Tango #5162

dirkgr · 2021-04-28T01:12:15Z

1700 lines is pretty uncomfortable to review, so let me attempt a guide:

Steps

The most important thing is the Step class. It defines one step in a workflow. Users are expected to just write a run() method. The run() method must have parameters with type hints. from_params() reads those type hints to construct Steps. If the run() method takes a parameter of type T, then from_params() assumes the constructor of that step takes a Union[T, Step[T]]. In other words, you can provide the T directly, or you can put in a Step that outputs a T. Making Steps the input is how you define a DAG of tasks. The Step code makes sure to replace inputs of type Step with the Step's results before the run() method runs.

Hopefully most of this magic will remain hidden to users, but as a reviewer you have to understand it.

Steps also store some settings:

DETERMINSTIC describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, the step can't be cached, and neither can any step that depends on it.
CACHEABLE provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn't need to be cached, because HuggingFace datasets already have their own caching mechanism. But it's still a deterministic step, and all following steps are allowed to cache.
VERSION is optional, but recommended. This gives the user a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn't invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
FORMAT: See below

Those settings above are per Step class. Every instance of Step has some more settings. You can override the format and whether to cache the results, and there are two more:

step_name allows you to give your step a useful name that you can use to refer to it. This name will be used in many places, like names of directories for results, log messages, etc. If this is not given, the step's unique id stands in.
produce_results specifies whether this is a step where we care about the the results. For example, you might build a long pipeline of steps, but you really only want to look at the evaluation at the end. In that case, the evaluation step is the only step where you would set produce_results to True. If none of your steps have produce_results set to True, Tango will do nothing. It will only run steps that are necessary to produce the results you need.

Every step has a unique id of the form f"{self.__class__.__name__}-{self.VERSION}-{hash of input}".

`RefStep`

_RefStep is a fake stand-in for a real step. This is used when parsing a DAG of steps. Every step gets parsed on its own, and references to other steps are parsed as a _RefStep. Then later, when all the steps are known, _RefSteps get resolved to real steps. A final DAG should never contain a _RefStep.

Formats

The Format classes are basically mappings from Python objects to disk and back. They have a read() and a write() method. They read and write directories, not files. At the end of the write() method, the object must be completely written, but the read() method can return an object that will do the actual reading lazily. We might use this for reading datasets, where a read() method can return immediately, and return an object that reads actual instances later.

Every Step defines the Format it wants to use to serialize and deserialize its result.

DillFormat is the default format for all steps. It uses dill (a better version of pickle) to serialize and deserialize objects. It is surprisingly flexible and can handle almost all Python objects (including (some) functions).

`StepCache`

StepCache is a mapping from instances of Step to the results of that step. There are two implementations, MemoryStepCache and DirectoryStepCache. DirectoryStepCache is the component that uses the Format classes to cache outputs in a directory. When you run allennlp tango -s serialization_dir, then the step cache ends up in serialization_dir/step_cache.

Dataset

Tango datasets are a bit different, in that they contain all splits simultaneously, as well as a vocabulary. The definition of AllenNlpDataset is this:

@dataclass
class AllenNlpDataset:
    splits: Mapping[str, Sequence[Any]]
    vocab: Optional[Vocabulary] = None
    metadata: Mapping[str, Any] = field(default_factory=dict)

For bigger datasets, the idea is that we write something that returns an object which can read instances lazily. For example, You could imagine that the Sequence[Any] that contains the instances refers to a directory on disk somewhere, and reads one file per instance. But that's all future stuff. In this PR, all the Sequences are Lists.

One important thing is this: In Tango, all datasets are map-style datasets (to use PyTorch terminology). Iterator-style datasets ("lazy datasets", as we call them in AllenNLP) are a pain, and are not necessary given the right tooling.

Dataloader

TangoDataLoader is simpler than the original AllenNLP DataLoader. Original data loaders are responsible for a ton of stuff, and have a bit of a messy API owing to the history of their development. Tango data loaders can be simpler because they don't have to deal with lazy datasets or multiprocessing. All they do is make batches out of Sequence[Instance].

The old MultiprocessingDataLoader handles a lot of scenarios:

normal, equal-sized batches
shuffling
manually setting the number of batches per epoch
using a PyTorch sampler

I have split these out into separate data loader classes, which are composable (i..e, one data loader feeds into another). If you think composing data loaders is too complicated for researchers who'd rather not think about it, I'm open to that argument. I think I'd want to keep this API internally because it makes for small classes that are easy to understand, but maybe we write an interface that makes it easier to configure these.

Example Steps

`TrainingStep`

This step takes basically all the inputs that GradientDescentTrainer needs, plus a dataset and split name, and trains a model on it. It adapters to GradientDescentTrainer. In another iteration, I want to switch to getting rid of the adapter and having the training code directly in the step, since I am pretty unhappy with the inconsistent trainer API we have.

`EvaluationStep`

This step takes a model, a dataset, and the name of a split, and produces an evaluation result. It's roughly equivalent to allennlp evaluate.

`HuggingfaceDataset`

This step loads a HuggingFace dataset and puts it into AllenNLP dataset format. It's a one-liner of a step.

`HuggingfaceTokenize`

This step uses the HuggingFace tokenizers to turn every str in a given dataset into a TransformerTextField. That sounds like it would be super useful, but I actually ended up not using it for PIQA. So far I have found that I always need some more specialized treatment of the dataset before turning it into the input to a model.

This is untested.

epwalsh

Thanks for addressing all my comments! LGTM ✅

AkshitaB · 2021-08-05T01:26:09Z

allennlp/common/file_utils.py

+                readonly=read_only,
+                lock=use_lock,
+            )
+            _active_tensor_caches[self.lmdb_env.path()] = self


Where do we remove entries from _active_tensor_caches? I see it in an older commit, but not in the latest one.

_active_tensor_caches is a WeakValueDict, which removes entries automatically when the values are GCd.

epwalsh · 2021-08-05T15:55:22Z

allennlp/common/file_utils.py

+    def __new__(cls, filename: Union[str, PathLike], *, read_only: bool = False, **kwargs):
+        # This mechanism makes sure we re-use open lmdb file handles. Lmdb has a problem when the same file is
+        # opened by the same process multiple times. This is our workaround.
+        filename = str(filename)


We should probably normalize filename to an absolute path here, right?

I'll do you one better and do it by inode: 153bade

That way even symlinks and hard links work correctly.

dirkgr added 3 commits April 27, 2021 18:11

Basic step infrastructure

6f42c6f

Formatting

0e95903

Merge branch 'main' into Tango

d1a8320

dirkgr self-assigned this Apr 28, 2021

dirkgr added 26 commits May 4, 2021 11:08

Merge remote-tracking branch 'origin/main' into Tango

08c3f55

Adds a proper StepCache, plus a bunch of other stuff

8c72499

Formatting

0f0d999

Actually run a workflow

9d57f7f

This is untested.

Typo

c02772f

More Step infra

10ad6a5

Fix the symlink

b1abd2e

Print some output

35e3e04

ensure_results()

672f255

Write to a temporary location before writing to the final one

18fb9ee

Steps for reading and tokenizing a dataset

5a9a7f4

Formatting

c7062d0

Remove old TODO

2954779

Show more errors when something can't be instantiated with from_params

c54fa66

Batches have a length

ae8ae8a

Make mypy happy

0d6cb33

Updated dataset definition

1c00563

Use cached_transformers wherever possible

d056bca

Step for PIQA instances

fc207ae

Some unimportant mypy stuff

cef91b4

Compatibility with huggingface

b400201

Create a mask if we have to

ab589dd

Make LongTensors

d516ba5

Easy access to the output dimension of an activation layer

0924537

Take an ignore an attention mask in TransformerEmbeddings

7a0952d

Make it so a pooler can be derived from a huggingface module

6c3a49b

dirkgr added 9 commits July 29, 2021 14:51

Use det_hash, giving a way to override how unique hashes are computed

729f6ab

to_params()

1816783

Some more documentation for to_params()

6372aa9

Merge branch 'main' into Tango

58351e5

Better dethash for types

423d964

Better hash for formats in steps

bd9b788

to_params and named parameters for data loaders

cd63e02

Merge branch 'Tango' of /~https://github.com/allenai/allennlp into Tango

6d25012

😳

4c085c2

epwalsh approved these changes Aug 3, 2021

View reviewed changes

dirkgr and others added 12 commits August 3, 2021 15:02

Support tuples in input

687f3b8

Merge branch 'main' into Tango

2cb729e

Stolen check for how to call __new__

8334b31

Fix some type checks

c0c42e8

Makes RefStep as the default step work

cb87db6

Changes how from_params works with steps

bb8bb0f

Merge branch 'Tango' of /~https://github.com/allenai/allennlp into Tango

1bcdf05

Fix tests

a1595d4

Chasing that locktable error

b9624cf

Hopefully fixes the MDB_BAD_RSLOT error

4e15aab

Typo

b7fcb24

That wasn't how __new__ works. This is.

b7d4a92

AkshitaB suggested changes Aug 5, 2021

View reviewed changes

epwalsh reviewed Aug 5, 2021

View reviewed changes

AkshitaB approved these changes Aug 5, 2021

View reviewed changes

Unique file ids

153bade

dirkgr enabled auto-merge (squash) August 5, 2021 18:02

dirkgr merged commit 311f110 into main Aug 5, 2021

dirkgr deleted the Tango branch August 5, 2021 18:11

h-vetinari mentioned this pull request Nov 19, 2021

allennlp v2.7.0 conda-forge/allennlp-feedstock#30

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tango #5162

Tango #5162

dirkgr commented Apr 28, 2021 •

edited

Loading

epwalsh left a comment

AkshitaB Aug 5, 2021

dirkgr Aug 5, 2021

epwalsh Aug 5, 2021

dirkgr Aug 5, 2021

epwalsh Aug 5, 2021

Tango #5162

Tango #5162

Conversation

dirkgr commented Apr 28, 2021 • edited Loading

Steps

RefStep

Formats

StepCache

Dataset

Dataloader

Example Steps

TrainingStep

EvaluationStep

HuggingfaceDataset

HuggingfaceTokenize

epwalsh left a comment

Choose a reason for hiding this comment

AkshitaB Aug 5, 2021

Choose a reason for hiding this comment

dirkgr Aug 5, 2021

Choose a reason for hiding this comment

epwalsh Aug 5, 2021

Choose a reason for hiding this comment

dirkgr Aug 5, 2021

Choose a reason for hiding this comment

epwalsh Aug 5, 2021

Choose a reason for hiding this comment

dirkgr commented Apr 28, 2021 •

edited

Loading

`RefStep`

`StepCache`

`TrainingStep`

`EvaluationStep`

`HuggingfaceDataset`

`HuggingfaceTokenize`