-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Tango #5162
Conversation
This is untested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all my comments! LGTM ✅
allennlp/common/file_utils.py
Outdated
readonly=read_only, | ||
lock=use_lock, | ||
) | ||
_active_tensor_caches[self.lmdb_env.path()] = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we remove entries from _active_tensor_caches
? I see it in an older commit, but not in the latest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_active_tensor_caches
is a WeakValueDict
, which removes entries automatically when the values are GCd.
def __new__(cls, filename: Union[str, PathLike], *, read_only: bool = False, **kwargs): | ||
# This mechanism makes sure we re-use open lmdb file handles. Lmdb has a problem when the same file is | ||
# opened by the same process multiple times. This is our workaround. | ||
filename = str(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably normalize filename
to an absolute path here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do you one better and do it by inode: 153bade
That way even symlinks and hard links work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
1700 lines is pretty uncomfortable to review, so let me attempt a guide:
Steps
The most important thing is the
Step
class. It defines one step in a workflow. Users are expected to just write arun()
method. Therun()
method must have parameters with type hints.from_params()
reads those type hints to constructStep
s. If therun()
method takes a parameter of typeT
, thenfrom_params()
assumes the constructor of that step takes aUnion[T, Step[T]]
. In other words, you can provide theT
directly, or you can put in aStep
that outputs aT
. MakingStep
s the input is how you define a DAG of tasks. TheStep
code makes sure to replace inputs of typeStep
with theStep
's results before therun()
method runs.Hopefully most of this magic will remain hidden to users, but as a reviewer you have to understand it.
Step
s also store some settings:DETERMINSTIC
describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this isFalse
, the step can't be cached, and neither can any step that depends on it.CACHEABLE
provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn't need to be cached, because HuggingFace datasets already have their own caching mechanism. But it's still a deterministic step, and all following steps are allowed to cache.VERSION
is optional, but recommended. This gives the user a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn't invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.FORMAT
: See belowThose settings above are per
Step
class. Every instance ofStep
has some more settings. You can override the format and whether to cache the results, and there are two more:step_name
allows you to give your step a useful name that you can use to refer to it. This name will be used in many places, like names of directories for results, log messages, etc. If this is not given, the step's unique id stands in.produce_results
specifies whether this is a step where we care about the the results. For example, you might build a long pipeline of steps, but you really only want to look at the evaluation at the end. In that case, the evaluation step is the only step where you would setproduce_results
toTrue
. If none of your steps haveproduce_results
set toTrue
, Tango will do nothing. It will only run steps that are necessary to produce the results you need.Every step has a unique id of the form
f"{self.__class__.__name__}-{self.VERSION}-{hash of input}"
.RefStep
_RefStep
is a fake stand-in for a real step. This is used when parsing a DAG of steps. Every step gets parsed on its own, and references to other steps are parsed as a_RefStep
. Then later, when all the steps are known,_RefStep
s get resolved to real steps. A final DAG should never contain a_RefStep
.Formats
The
Format
classes are basically mappings from Python objects to disk and back. They have aread()
and awrite()
method. They read and write directories, not files. At the end of thewrite()
method, the object must be completely written, but theread()
method can return an object that will do the actual reading lazily. We might use this for reading datasets, where aread()
method can return immediately, and return an object that reads actual instances later.Every
Step
defines theFormat
it wants to use to serialize and deserialize its result.DillFormat
is the default format for all steps. It usesdill
(a better version ofpickle
) to serialize and deserialize objects. It is surprisingly flexible and can handle almost all Python objects (including (some) functions).StepCache
StepCache
is a mapping from instances ofStep
to the results of that step. There are two implementations,MemoryStepCache
andDirectoryStepCache
.DirectoryStepCache
is the component that uses theFormat
classes to cache outputs in a directory. When you runallennlp tango -s serialization_dir
, then the step cache ends up inserialization_dir/step_cache
.Dataset
Tango datasets are a bit different, in that they contain all splits simultaneously, as well as a vocabulary. The definition of
AllenNlpDataset
is this:For bigger datasets, the idea is that we write something that returns an object which can read instances lazily. For example, You could imagine that the
Sequence[Any]
that contains the instances refers to a directory on disk somewhere, and reads one file per instance. But that's all future stuff. In this PR, all theSequence
s areList
s.One important thing is this: In Tango, all datasets are map-style datasets (to use PyTorch terminology). Iterator-style datasets ("lazy datasets", as we call them in AllenNLP) are a pain, and are not necessary given the right tooling.
Dataloader
TangoDataLoader
is simpler than the original AllenNLPDataLoader
. Original data loaders are responsible for a ton of stuff, and have a bit of a messy API owing to the history of their development. Tango data loaders can be simpler because they don't have to deal with lazy datasets or multiprocessing. All they do is make batches out ofSequence[Instance]
.The old
MultiprocessingDataLoader
handles a lot of scenarios:I have split these out into separate data loader classes, which are composable (i..e, one data loader feeds into another). If you think composing data loaders is too complicated for researchers who'd rather not think about it, I'm open to that argument. I think I'd want to keep this API internally because it makes for small classes that are easy to understand, but maybe we write an interface that makes it easier to configure these.
Example Steps
TrainingStep
This step takes basically all the inputs that
GradientDescentTrainer
needs, plus a dataset and split name, and trains a model on it. It adapters toGradientDescentTrainer
. In another iteration, I want to switch to getting rid of the adapter and having the training code directly in the step, since I am pretty unhappy with the inconsistent trainer API we have.EvaluationStep
This step takes a model, a dataset, and the name of a split, and produces an evaluation result. It's roughly equivalent to
allennlp evaluate
.HuggingfaceDataset
This step loads a HuggingFace dataset and puts it into AllenNLP dataset format. It's a one-liner of a step.
HuggingfaceTokenize
This step uses the HuggingFace tokenizers to turn every
str
in a given dataset into aTransformerTextField
. That sounds like it would be super useful, but I actually ended up not using it for PIQA. So far I have found that I always need some more specialized treatment of the dataset before turning it into the input to a model.