Support for bart in allennlp-models (#4169)

* Support for bart in allennlp-models - Added option to PretrainedTransformerEmbedder to allow usage with encoder-decoder models and created unit test. - Added ROUGE-N metric. ROUGE-L will follow soon - Added indices to tokens abstract method to TokenIndexer. Implemented for PretrainedTransformerIndexer. This is useful for turning decoded sequences in seq2seq models into text. - added timestep parameter to step function in beamsearch - other minor changes * Implemented ROUGE-L, updated ROUGE-N, new tests - Implemented ROUGE-L metric (F1 score) - Implemented ROUGE-N recall, precision and F1 as metrics that can be accessed separately - Now computing overall ROUGE-N/L as average over scores of each sequence pair, rather than summing counts across all pairs and then computing the metric - added tests for new padding behavior in get_text_field_mask - added test for ROUGE-N/L - stylistic improvements * Polynomial lr scheduling, max tokens batch sampling, other small changes - Implemented Polynomial learning rate scheduling, which is used in BART. The implementeation is based on Fairseqs and tensorflows implementation. - Implementation an option to specify the number of maximum tokens per batch, rather than specifiying a fixed batch size. This is also used for fine-tuning BART. Added a unit test too. - For indices_to_tokens, removed code that removes the cls/sep tokens introduced by max length. Added a test to reflect this. * Small stylistic changes * Added documentation, separated max tokens sampler, fixed circular important, memory tracking per batch, polynomical lr decay bug fix - Added documentation for lazy_groups_of_max_size - Some stylistic changes - Made MaxTokensBatchSampler a subclass of BucketBatchSampler - Annotated beam search with no grad - fixed bug in poly decay related to lr of first batch - fixed circular import, finally - added gpu/cpu memory tracking for tensorboard for batches (previously this was only possible for epochs) * Fixed linting errors, fixed rouge test - TODO: fix `TestPretrainedTransformerEmbedder.test_encoder_decoder_model` and TestPretrainedTransformerIndexer.test_indices_to_tokens. Both issues are related to the new tokenizers * Fixed issues with new tokenizers - fixed issue with roberta based tokenizers in pretrained_transformer_indexer - temporary fix for incorrect types ids when using max length for tokens_to_indices in PretrainedTransformerIndexer - fixed indexer test to not compare idx and idx_end of Tokens * Added max tokens batch sampler to __init__.py * Fixed max tokens sampler to account for padding * Fixed large batches due to short source sequences but long target sequences in max tokens batch sampler * Formatting * Filled in the changelog * Tests have moved * Fix docs * Adds a test for the max tokens sampler * Adds warning when a single instance is too big * More docs changes * Formatting * Docs * Fix old models * Fixed linting and type checking errors * Fix docs build * Fix circular imports Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
allenai · May 29, 2020 · 5ad7a33 · 5ad7a33
1 parent 25134f2
commit 5ad7a33
Show file tree

Hide file tree

Showing 29 changed files with 980 additions and 166 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,15 +14,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Our caching mechanism had the potential to introduce race conditions if multiple processes
   were attempting to cache the same file at once. This was fixed by using a lock file tied to each
   cached file.
+- `get_text_field_mask()` now supports padding indices that are not `0`.
 
 ### Added
 
-- A `duplicate()` method on `Instance`s and `Field`s, to be used instead of `copy.deepcopy()`.
+- A `duplicate()` method on `Instance`s and `Field`s, to be used instead of `copy.deepcopy()`
+- A batch sampler that makes sure each batch contains approximately the same number of tokens (`MaxTokensBatchSampler`)
+- Functions to turn a sequence of token indices back into tokens
+- The ability to use Huggingface encoder/decoder models as token embedders
+- Improvements to beam search
+- ROUGE metric
+- Polynomial decay learning rate scheduler
 
 ### Changed
 
 - Similar to our caching mechanism, we introduced a lock file to the vocab to avoid race
   conditions when saving/loading the vocab from/to the same serialization directory in different processes.
+- The trainer now logs CPU and GPU memory usage to tensorboard.
 
 ## [v1.0.0rc5](/~https://github.com/allenai/allennlp/releases/tag/v1.0.0rc5) - 2020-05-26
 

diff --git a/allennlp/data/samplers/__init__.py b/allennlp/data/samplers/__init__.py
@@ -8,3 +8,4 @@
     BasicBatchSampler,
 )
 from allennlp.data.samplers.bucket_batch_sampler import BucketBatchSampler
+from allennlp.data.samplers.max_tokens_batch_sampler import MaxTokensBatchSampler
diff --git a/allennlp/data/samplers/bucket_batch_sampler.py b/allennlp/data/samplers/bucket_batch_sampler.py
@@ -1,5 +1,5 @@
 import logging
-from typing import List, Iterable
+from typing import List, Iterable, Tuple
 import random
 import math
 
@@ -81,7 +81,9 @@ def __init__(
         self.data_source = data_source
         self.drop_last = drop_last
 
-    def _argsort_by_padding(self, instances: Iterable[Instance]) -> List[int]:
+    def _argsort_by_padding(
+        self, instances: Iterable[Instance]
+    ) -> Tuple[List[int], List[List[int]]]:
         """
         Argsorts the instances by their padding lengths, using the keys in
         `sorting_keys` (in the order in which they are provided). `sorting_keys`
@@ -95,23 +97,26 @@ def _argsort_by_padding(self, instances: Iterable[Instance]) -> List[int]:
         for instance in instances:
             # Make sure instance is indexed before calling .get_padding
             lengths = []
+            noisy_lengths = []
             for field_name in self.sorting_keys:
                 if field_name not in instance.fields:
                     raise ConfigurationError(
                         f'Sorting key "{field_name}" is not a field in instance. '
                         f"Available fields/keys are {list(instance.fields.keys())}."
                     )
-                lengths.append(
-                    add_noise_to_value(len(instance.fields[field_name]), self.padding_noise)
-                )
-            instances_with_lengths.append((lengths, instance))
+                lengths.append(len(instance.fields[field_name]))
+
+                noisy_lengths.append(add_noise_to_value(lengths[-1], self.padding_noise))
+            instances_with_lengths.append((noisy_lengths, lengths, instance))
         with_indices = [(x, i) for i, x in enumerate(instances_with_lengths)]
         with_indices.sort(key=lambda x: x[0][0])
-        return [instance_with_index[-1] for instance_with_index in with_indices]
+        return (
+            [instance_with_index[-1] for instance_with_index in with_indices],
+            [instance_with_index[0][1] for instance_with_index in with_indices],
+        )
 
     def __iter__(self) -> Iterable[List[int]]:
-
-        indices = self._argsort_by_padding(self.data_source)
+        indices, _ = self._argsort_by_padding(self.data_source)
         batches = []
         for group in lazy_groups_of(indices, self.batch_size):
             batch_indices = list(group)

diff --git a/allennlp/data/samplers/max_tokens_batch_sampler.py b/allennlp/data/samplers/max_tokens_batch_sampler.py
@@ -0,0 +1,112 @@
+import logging
+import random
+from typing import List, Iterable, Optional, Iterator, TypeVar
+
+from allennlp.data.samplers import BatchSampler, BucketBatchSampler
+from torch.utils import data
+
+logger = logging.getLogger(__name__)
+
+
+A = TypeVar("A")
+
+
+@BatchSampler.register("max_tokens_sampler")
+class MaxTokensBatchSampler(BucketBatchSampler):
+    """
+    An sampler which by default, argsorts batches with respect to the maximum input lengths `per
+    batch`. Batches are then created such that the number of tokens in a batch does not exceed the given
+    maximum number of tokens. You can provide a list of field names and padding keys (or pass none, in which case
+    they will be inferred) which the dataset will be sorted by before doing this batching, causing inputs
+    with similar length to be batched together, making computation more efficient (as less time is
+    wasted on padded elements of the batch).
+
+    # Parameters
+
+    data_source: `data.Dataset`
+        The pytorch `Dataset` of allennlp Instances to bucket.
+
+    max_tokens : `int`
+        The maximum number of tokens to include in a batch.
+
+    sorting_keys : `List[str]`, optional
+        To bucket inputs into batches, we want to group the instances by padding length, so that we
+        minimize the amount of padding necessary per batch. In order to do this, we need to know
+        which fields need what type of padding, and in what order.
+
+        Specifying the right keys for this is a bit cryptic, so if this is not given we try to
+        auto-detect the right keys by iterating through a few instances upfront, reading all of the
+        padding keys and seeing which one has the longest length.  We use that one for padding.
+        This should give reasonable results in most cases. Some cases where it might not be the
+        right thing to do are when you have a `ListField[TextField]`, or when you have a really
+        long, constant length `ArrayField`.
+
+        When you need to specify this yourself, you can create an instance from your dataset and
+        call `Instance.get_padding_lengths()` to see a list of all keys used in your data.  You
+        should give one or more of those as the sorting keys here.
+
+    padding_noise : `float`, optional (default = `0.1`)
+        When sorting by padding length, we add a bit of noise to the lengths, so that the sorting
+        isn't deterministic.  This parameter determines how much noise we add, as a percentage of
+        the actual padding value for each instance.
+    """
+
+    def __init__(
+        self,
+        data_source: data.Dataset,
+        max_tokens: Optional[int] = None,
+        sorting_keys: List[str] = None,
+        padding_noise: float = 0.1,
+    ):
+        super().__init__(data_source, -1, sorting_keys, padding_noise, False)
+
+        self.max_tokens = max_tokens
+
+    def _lazy_groups_of_max_size(
+        self, iterable: Iterable[A], sizes: Iterable[int],
+    ) -> Iterator[List[A]]:
+        """
+        Takes an `iterable` of data and an iterable `sizes` of the same length which represents the sizes of each
+        corresponding item in `iterable`. The instances from `iterable` are batched such that the total size
+        of the batch as computed from `sizes` does not exceed `max_size`.
+        """
+        cur_max_size = 0
+        group: List[A] = []
+
+        iterator = iter(iterable)
+        size_iter = iter(sizes)
+
+        for item, size in zip(iterator, size_iter):
+            if size > self.max_tokens:
+                logger.warning(
+                    "Found instance of size %d, which is bigger than the expected size for a batch (%d)",
+                    size,
+                    self.max_tokens,
+                )
+            group_size = max(size, cur_max_size) * (len(group) + 1)
+
+            if group_size > self.max_tokens:
+                yield group
+                cur_max_size = 0
+                group = []
+
+            group.append(item)
+            cur_max_size = max(cur_max_size, size)
+
+        if len(group) != 0:
+            yield group
+
+    def __iter__(self) -> Iterable[List[int]]:
+        indices, lengths = self._argsort_by_padding(self.data_source)
+
+        max_lengths = [max(length) for length in lengths]
+        group_iterator = self._lazy_groups_of_max_size(indices, max_lengths)
+
+        batches = [list(group) for group in group_iterator]
+        random.shuffle(batches)
+        for batch in batches:
+            yield batch
+
+    def __len__(self):
+        # There is no easy way to count the number of batches, so we need to iterate and count.
+        return sum(1 for _ in self)
diff --git a/allennlp/data/token_indexers/pretrained_transformer_indexer.py b/allennlp/data/token_indexers/pretrained_transformer_indexer.py
@@ -102,6 +102,22 @@ def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> Inde
 
         return self._postprocess_output(output)
 
+    @overrides
+    def indices_to_tokens(
+        self, indexed_tokens: IndexedTokenList, vocabulary: Vocabulary
+    ) -> List[Token]:
+        token_ids = indexed_tokens["token_ids"]
+        type_ids = indexed_tokens.get("type_ids")
+
+        return [
+            Token(
+                text=vocabulary.get_token_from_index(token_ids[i], self._namespace),
+                text_id=token_ids[i],
+                type_id=type_ids[i] if type_ids is not None else None,
+            )
+            for i in range(len(token_ids))
+        ]
+
     def _extract_token_and_type_ids(
         self, tokens: List[Token]
     ) -> Tuple[List[int], Optional[List[int]]]:
@@ -162,10 +178,7 @@ def _postprocess_output(self, output: IndexedTokenList) -> IndexedTokenList:
             indices = [i for segment in folded_indices for i in segment]
 
             output["token_ids"] = indices
-            # `create_token_type_ids_from_sequences()` inserts special tokens
-            output["type_ids"] = self._tokenizer.create_token_type_ids_from_sequences(
-                indices[self._num_added_start_tokens : -self._num_added_end_tokens]
-            )
+            output["type_ids"] = [0] * len(indices)
             output["segment_concat_mask"] = [True] * len(indices)
 
         return output

diff --git a/allennlp/data/token_indexers/token_indexer.py b/allennlp/data/token_indexers/token_indexer.py
@@ -64,6 +64,15 @@ def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> Inde
         """
         raise NotImplementedError
 
+    def indices_to_tokens(
+        self, indexed_tokens: IndexedTokenList, vocabulary: Vocabulary
+    ) -> List[Token]:
+        """
+        Inverse operations of tokens_to_indices. Takes an `IndexedTokenList` and converts it back
+        into a list of tokens.
+        """
+        raise NotImplementedError
+
     def get_empty_token_list(self) -> IndexedTokenList:
         """
         Returns an `already indexed` version of an empty token list.  This is typically just an

diff --git a/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py b/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py
@@ -124,9 +124,9 @@ def _reverse_engineer_special_tokens(
             return_token_type_ids=True,
             return_attention_mask=False,
         )
-        dummy_a = self.tokenizer.encode(token_a, add_special_tokens=False)[0]
+        dummy_a = self.tokenizer.encode(token_a, add_special_tokens=False, add_prefix_space=True)[0]
         assert dummy_a in dummy_output["input_ids"]
-        dummy_b = self.tokenizer.encode(token_b, add_special_tokens=False)[0]
+        dummy_b = self.tokenizer.encode(token_b, add_special_tokens=False, add_prefix_space=True)[0]
         assert dummy_b in dummy_output["input_ids"]
         assert dummy_a != dummy_b
 
@@ -181,6 +181,7 @@ def _reverse_engineer_special_tokens(
             add_special_tokens=True,
             return_token_type_ids=True,
             return_attention_mask=False,
+            add_prefix_space=True,
         )
 
         seen_dummy_a = False

diff --git a/allennlp/modules/token_embedders/pretrained_transformer_embedder.py b/allennlp/modules/token_embedders/pretrained_transformer_embedder.py
@@ -30,15 +30,23 @@ class PretrainedTransformerEmbedder(TokenEmbedder):
         through the transformer model independently, and concatenate the final representations.
         Should be set to the same value as the `max_length` option on the
         `PretrainedTransformerIndexer`.
+    sub_module: `str`, optional (default = `None`)
+        The name of a submodule of the transformer to be used as the embedder. Some transformers naturally act
+        as embedders such as BERT. However, other models consist of encoder and decoder, in which case we just
+        want to use the encoder.
     """
 
-    def __init__(self, model_name: str, max_length: int = None) -> None:
+    def __init__(self, model_name: str, max_length: int = None, sub_module: str = None) -> None:
         super().__init__()
         self.transformer_model = AutoModel.from_pretrained(model_name)
+        self.config = self.transformer_model.config
+        if sub_module:
+            assert hasattr(self.transformer_model, sub_module)
+            self.transformer_model = getattr(self.transformer_model, sub_module)
         self._max_length = max_length
         # I'm not sure if this works for all models; open an issue on github if you find a case
         # where it doesn't work.
-        self.output_dim = self.transformer_model.config.hidden_size
+        self.output_dim = self.config.hidden_size
 
         tokenizer = PretrainedTransformerTokenizer(model_name)
         self._num_added_start_tokens = len(tokenizer.single_sequence_start_tokens)
@@ -50,11 +58,10 @@ def get_output_dim(self):
         return self.output_dim
 
     def _number_of_token_type_embeddings(self):
-        config = self.transformer_model.config
-        if isinstance(config, XLNetConfig):
+        if isinstance(self.config, XLNetConfig):
             return 3  # XLNet has 3 type ids
-        elif hasattr(config, "type_vocab_size"):
-            return config.type_vocab_size
+        elif hasattr(self.config, "type_vocab_size"):
+            return self.config.type_vocab_size
         else:
             return 0