Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Commit

Permalink
Support for bart in allennlp-models (#4169)
Browse files Browse the repository at this point in the history
* Support for bart in allennlp-models

- Added option to PretrainedTransformerEmbedder to allow usage with
encoder-decoder models and created unit test.
- Added ROUGE-N metric. ROUGE-L will follow soon
- Added indices to tokens abstract method to TokenIndexer. Implemented
for PretrainedTransformerIndexer. This is useful for turning decoded
sequences in seq2seq models into text.
- added timestep parameter to step function in beamsearch
- other minor changes

* Implemented ROUGE-L, updated ROUGE-N, new tests
- Implemented ROUGE-L metric (F1 score)
- Implemented ROUGE-N recall, precision and F1 as metrics that can be
accessed separately
- Now computing overall ROUGE-N/L as average over scores of each
sequence pair, rather than summing counts across all pairs and then
computing the metric
- added tests for new padding behavior in get_text_field_mask
- added test for ROUGE-N/L
- stylistic improvements

* Polynomial lr scheduling, max tokens batch sampling, other small changes

- Implemented Polynomial learning rate scheduling, which is used in
BART. The implementeation is based on Fairseqs and tensorflows
implementation.

- Implementation an option to specify the number of maximum tokens per
batch, rather than specifiying a fixed batch size. This is also used for
fine-tuning BART. Added a unit test too.

- For indices_to_tokens, removed code that removes the cls/sep tokens
introduced by max length. Added a test to reflect this.

* Small stylistic changes

* Added documentation, separated max tokens sampler, fixed circular
important, memory tracking per batch, polynomical lr decay bug fix
- Added documentation for lazy_groups_of_max_size
- Some stylistic changes
- Made MaxTokensBatchSampler a subclass of BucketBatchSampler
- Annotated beam search with no grad
- fixed bug in poly decay related to lr of first batch
- fixed circular import, finally
- added gpu/cpu memory tracking for tensorboard for batches (previously
this was only possible for epochs)

* Fixed linting errors, fixed rouge test

- TODO: fix
`TestPretrainedTransformerEmbedder.test_encoder_decoder_model` and
TestPretrainedTransformerIndexer.test_indices_to_tokens. Both issues are
related to the new tokenizers

* Fixed issues with new tokenizers
- fixed issue with roberta based tokenizers in
pretrained_transformer_indexer
- temporary fix for incorrect types ids when using max length for
tokens_to_indices in PretrainedTransformerIndexer
- fixed indexer test to not compare idx and idx_end of Tokens

* Added max tokens batch sampler to __init__.py

* Fixed max tokens sampler to account for padding

* Fixed large batches due to short source sequences but long target
sequences in max tokens batch sampler

* Formatting

* Filled in the changelog

* Tests have moved

* Fix docs

* Adds a test for the max tokens sampler

* Adds warning when a single instance is too big

* More docs changes

* Formatting

* Docs

* Fix old models

* Fixed linting and type checking errors

* Fix docs build

* Fix circular imports

Co-authored-by: Dirk Groeneveld <dirkg@allenai.org>
  • Loading branch information
Tobias Rohde and dirkgr authored May 29, 2020
1 parent 25134f2 commit 5ad7a33
Show file tree
Hide file tree
Showing 29 changed files with 980 additions and 166 deletions.
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Our caching mechanism had the potential to introduce race conditions if multiple processes
were attempting to cache the same file at once. This was fixed by using a lock file tied to each
cached file.
- `get_text_field_mask()` now supports padding indices that are not `0`.

### Added

- A `duplicate()` method on `Instance`s and `Field`s, to be used instead of `copy.deepcopy()`.
- A `duplicate()` method on `Instance`s and `Field`s, to be used instead of `copy.deepcopy()`
- A batch sampler that makes sure each batch contains approximately the same number of tokens (`MaxTokensBatchSampler`)
- Functions to turn a sequence of token indices back into tokens
- The ability to use Huggingface encoder/decoder models as token embedders
- Improvements to beam search
- ROUGE metric
- Polynomial decay learning rate scheduler

### Changed

- Similar to our caching mechanism, we introduced a lock file to the vocab to avoid race
conditions when saving/loading the vocab from/to the same serialization directory in different processes.
- The trainer now logs CPU and GPU memory usage to tensorboard.

## [v1.0.0rc5](/~https://github.com/allenai/allennlp/releases/tag/v1.0.0rc5) - 2020-05-26

Expand Down
1 change: 1 addition & 0 deletions allennlp/data/samplers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@
BasicBatchSampler,
)
from allennlp.data.samplers.bucket_batch_sampler import BucketBatchSampler
from allennlp.data.samplers.max_tokens_batch_sampler import MaxTokensBatchSampler
23 changes: 14 additions & 9 deletions allennlp/data/samplers/bucket_batch_sampler.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import logging
from typing import List, Iterable
from typing import List, Iterable, Tuple
import random
import math

Expand Down Expand Up @@ -81,7 +81,9 @@ def __init__(
self.data_source = data_source
self.drop_last = drop_last

def _argsort_by_padding(self, instances: Iterable[Instance]) -> List[int]:
def _argsort_by_padding(
self, instances: Iterable[Instance]
) -> Tuple[List[int], List[List[int]]]:
"""
Argsorts the instances by their padding lengths, using the keys in
`sorting_keys` (in the order in which they are provided). `sorting_keys`
Expand All @@ -95,23 +97,26 @@ def _argsort_by_padding(self, instances: Iterable[Instance]) -> List[int]:
for instance in instances:
# Make sure instance is indexed before calling .get_padding
lengths = []
noisy_lengths = []
for field_name in self.sorting_keys:
if field_name not in instance.fields:
raise ConfigurationError(
f'Sorting key "{field_name}" is not a field in instance. '
f"Available fields/keys are {list(instance.fields.keys())}."
)
lengths.append(
add_noise_to_value(len(instance.fields[field_name]), self.padding_noise)
)
instances_with_lengths.append((lengths, instance))
lengths.append(len(instance.fields[field_name]))

noisy_lengths.append(add_noise_to_value(lengths[-1], self.padding_noise))
instances_with_lengths.append((noisy_lengths, lengths, instance))
with_indices = [(x, i) for i, x in enumerate(instances_with_lengths)]
with_indices.sort(key=lambda x: x[0][0])
return [instance_with_index[-1] for instance_with_index in with_indices]
return (
[instance_with_index[-1] for instance_with_index in with_indices],
[instance_with_index[0][1] for instance_with_index in with_indices],
)

def __iter__(self) -> Iterable[List[int]]:

indices = self._argsort_by_padding(self.data_source)
indices, _ = self._argsort_by_padding(self.data_source)
batches = []
for group in lazy_groups_of(indices, self.batch_size):
batch_indices = list(group)
Expand Down
112 changes: 112 additions & 0 deletions allennlp/data/samplers/max_tokens_batch_sampler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import logging
import random
from typing import List, Iterable, Optional, Iterator, TypeVar

from allennlp.data.samplers import BatchSampler, BucketBatchSampler
from torch.utils import data

logger = logging.getLogger(__name__)


A = TypeVar("A")


@BatchSampler.register("max_tokens_sampler")
class MaxTokensBatchSampler(BucketBatchSampler):
"""
An sampler which by default, argsorts batches with respect to the maximum input lengths `per
batch`. Batches are then created such that the number of tokens in a batch does not exceed the given
maximum number of tokens. You can provide a list of field names and padding keys (or pass none, in which case
they will be inferred) which the dataset will be sorted by before doing this batching, causing inputs
with similar length to be batched together, making computation more efficient (as less time is
wasted on padded elements of the batch).
# Parameters
data_source: `data.Dataset`
The pytorch `Dataset` of allennlp Instances to bucket.
max_tokens : `int`
The maximum number of tokens to include in a batch.
sorting_keys : `List[str]`, optional
To bucket inputs into batches, we want to group the instances by padding length, so that we
minimize the amount of padding necessary per batch. In order to do this, we need to know
which fields need what type of padding, and in what order.
Specifying the right keys for this is a bit cryptic, so if this is not given we try to
auto-detect the right keys by iterating through a few instances upfront, reading all of the
padding keys and seeing which one has the longest length. We use that one for padding.
This should give reasonable results in most cases. Some cases where it might not be the
right thing to do are when you have a `ListField[TextField]`, or when you have a really
long, constant length `ArrayField`.
When you need to specify this yourself, you can create an instance from your dataset and
call `Instance.get_padding_lengths()` to see a list of all keys used in your data. You
should give one or more of those as the sorting keys here.
padding_noise : `float`, optional (default = `0.1`)
When sorting by padding length, we add a bit of noise to the lengths, so that the sorting
isn't deterministic. This parameter determines how much noise we add, as a percentage of
the actual padding value for each instance.
"""

def __init__(
self,
data_source: data.Dataset,
max_tokens: Optional[int] = None,
sorting_keys: List[str] = None,
padding_noise: float = 0.1,
):
super().__init__(data_source, -1, sorting_keys, padding_noise, False)

self.max_tokens = max_tokens

def _lazy_groups_of_max_size(
self, iterable: Iterable[A], sizes: Iterable[int],
) -> Iterator[List[A]]:
"""
Takes an `iterable` of data and an iterable `sizes` of the same length which represents the sizes of each
corresponding item in `iterable`. The instances from `iterable` are batched such that the total size
of the batch as computed from `sizes` does not exceed `max_size`.
"""
cur_max_size = 0
group: List[A] = []

iterator = iter(iterable)
size_iter = iter(sizes)

for item, size in zip(iterator, size_iter):
if size > self.max_tokens:
logger.warning(
"Found instance of size %d, which is bigger than the expected size for a batch (%d)",
size,
self.max_tokens,
)
group_size = max(size, cur_max_size) * (len(group) + 1)

if group_size > self.max_tokens:
yield group
cur_max_size = 0
group = []

group.append(item)
cur_max_size = max(cur_max_size, size)

if len(group) != 0:
yield group

def __iter__(self) -> Iterable[List[int]]:
indices, lengths = self._argsort_by_padding(self.data_source)

max_lengths = [max(length) for length in lengths]
group_iterator = self._lazy_groups_of_max_size(indices, max_lengths)

batches = [list(group) for group in group_iterator]
random.shuffle(batches)
for batch in batches:
yield batch

def __len__(self):
# There is no easy way to count the number of batches, so we need to iterate and count.
return sum(1 for _ in self)
21 changes: 17 additions & 4 deletions allennlp/data/token_indexers/pretrained_transformer_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,22 @@ def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> Inde

return self._postprocess_output(output)

@overrides
def indices_to_tokens(
self, indexed_tokens: IndexedTokenList, vocabulary: Vocabulary
) -> List[Token]:
token_ids = indexed_tokens["token_ids"]
type_ids = indexed_tokens.get("type_ids")

return [
Token(
text=vocabulary.get_token_from_index(token_ids[i], self._namespace),
text_id=token_ids[i],
type_id=type_ids[i] if type_ids is not None else None,
)
for i in range(len(token_ids))
]

def _extract_token_and_type_ids(
self, tokens: List[Token]
) -> Tuple[List[int], Optional[List[int]]]:
Expand Down Expand Up @@ -162,10 +178,7 @@ def _postprocess_output(self, output: IndexedTokenList) -> IndexedTokenList:
indices = [i for segment in folded_indices for i in segment]

output["token_ids"] = indices
# `create_token_type_ids_from_sequences()` inserts special tokens
output["type_ids"] = self._tokenizer.create_token_type_ids_from_sequences(
indices[self._num_added_start_tokens : -self._num_added_end_tokens]
)
output["type_ids"] = [0] * len(indices)
output["segment_concat_mask"] = [True] * len(indices)

return output
Expand Down
9 changes: 9 additions & 0 deletions allennlp/data/token_indexers/token_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,15 @@ def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> Inde
"""
raise NotImplementedError

def indices_to_tokens(
self, indexed_tokens: IndexedTokenList, vocabulary: Vocabulary
) -> List[Token]:
"""
Inverse operations of tokens_to_indices. Takes an `IndexedTokenList` and converts it back
into a list of tokens.
"""
raise NotImplementedError

def get_empty_token_list(self) -> IndexedTokenList:
"""
Returns an `already indexed` version of an empty token list. This is typically just an
Expand Down
5 changes: 3 additions & 2 deletions allennlp/data/tokenizers/pretrained_transformer_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,9 @@ def _reverse_engineer_special_tokens(
return_token_type_ids=True,
return_attention_mask=False,
)
dummy_a = self.tokenizer.encode(token_a, add_special_tokens=False)[0]
dummy_a = self.tokenizer.encode(token_a, add_special_tokens=False, add_prefix_space=True)[0]
assert dummy_a in dummy_output["input_ids"]
dummy_b = self.tokenizer.encode(token_b, add_special_tokens=False)[0]
dummy_b = self.tokenizer.encode(token_b, add_special_tokens=False, add_prefix_space=True)[0]
assert dummy_b in dummy_output["input_ids"]
assert dummy_a != dummy_b

Expand Down Expand Up @@ -181,6 +181,7 @@ def _reverse_engineer_special_tokens(
add_special_tokens=True,
return_token_type_ids=True,
return_attention_mask=False,
add_prefix_space=True,
)

seen_dummy_a = False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,23 @@ class PretrainedTransformerEmbedder(TokenEmbedder):
through the transformer model independently, and concatenate the final representations.
Should be set to the same value as the `max_length` option on the
`PretrainedTransformerIndexer`.
sub_module: `str`, optional (default = `None`)
The name of a submodule of the transformer to be used as the embedder. Some transformers naturally act
as embedders such as BERT. However, other models consist of encoder and decoder, in which case we just
want to use the encoder.
"""

def __init__(self, model_name: str, max_length: int = None) -> None:
def __init__(self, model_name: str, max_length: int = None, sub_module: str = None) -> None:
super().__init__()
self.transformer_model = AutoModel.from_pretrained(model_name)
self.config = self.transformer_model.config
if sub_module:
assert hasattr(self.transformer_model, sub_module)
self.transformer_model = getattr(self.transformer_model, sub_module)
self._max_length = max_length
# I'm not sure if this works for all models; open an issue on github if you find a case
# where it doesn't work.
self.output_dim = self.transformer_model.config.hidden_size
self.output_dim = self.config.hidden_size

tokenizer = PretrainedTransformerTokenizer(model_name)
self._num_added_start_tokens = len(tokenizer.single_sequence_start_tokens)
Expand All @@ -50,11 +58,10 @@ def get_output_dim(self):
return self.output_dim

def _number_of_token_type_embeddings(self):
config = self.transformer_model.config
if isinstance(config, XLNetConfig):
if isinstance(self.config, XLNetConfig):
return 3 # XLNet has 3 type ids
elif hasattr(config, "type_vocab_size"):
return config.type_vocab_size
elif hasattr(self.config, "type_vocab_size"):
return self.config.type_vocab_size
else:
return 0

Expand Down
Loading

0 comments on commit 5ad7a33

Please sign in to comment.