Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Use the new tokenizers #3868

Merged
merged 40 commits into from
May 12, 2020
Merged

Use the new tokenizers #3868

merged 40 commits into from
May 12, 2020

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Feb 28, 2020

No description provided.

@dirkgr
Copy link
Member Author

dirkgr commented Feb 28, 2020

This is currently blocked on huggingface/transformers#3057.

@dirkgr
Copy link
Member Author

dirkgr commented Feb 28, 2020

Also huggingface/transformers#2917.

@dirkgr
Copy link
Member Author

dirkgr commented Feb 28, 2020

And huggingface/transformers#3058.

@bryant1410
Copy link
Contributor

Do you think support for using both the fast and slow versions should be provided? Maybe the fast version is buggy for some cases, so then the user can choose slow one.

Comment on lines 48 to 49
"NL",
"P",
Copy link
Contributor

@bryant1410 bryant1410 Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code sets token.text as a slice from the original text, using the offsets provided by the new tokenizers. The original test will never contain the ## characters that we had there before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you get the actual wordpieces now? Is that stored anywhere in the Token object? It'd be nice to have that show up in a test somewhere.

Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text, but I'd have to check to be sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.

@@ -169,99 +160,59 @@ def test_token_idx_roberta(self):
"<s>",
"A",
",",
"Ġnaïve", # RoBERTa has a funny way of encoding combining characters.
"naïve",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this breaks RoBERTa usage, because the tokenizer is what changed but the model is the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RoBERTa usage is broken because of the leading spaces. The special characters are now handled the right way (except where they break the offsets).

@mfuntowicz
Copy link

@bryant1410 , @dirkgr We're (@HUGGINFACE) actively looking at all these blocking issues, thanks for reporting 👍

@dirkgr
Copy link
Member Author

dirkgr commented Feb 28, 2020

Do you think support for using both the fast and slow versions should be provided? Maybe the fast version is buggy for some cases, so then the user can choose slow one.

We could certainly back up to that possibility, but I was really excited about getting rid of the hacky token offset code we had there. If we also allow the slow tokenizers, we have to have that back.

Edit: I also think that @mfuntowicz and co are on the case, so these won't be buggy for long. I'd rather not keep support for the old ones around to work around a problem that will disappear in a few days.

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

Found some more: huggingface/transformers#3091, huggingface/transformers#3088.

I'll go do something else now until the next huggingface release. No sense in filing 100 nearly identical issues :-)

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

@ZhaofengWu, you might be interested in this, because it re-writes your version of intra_word_tokenize().

@ZhaofengWu
Copy link
Contributor

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

@ZhaofengWu
Copy link
Contributor

I'm a bit confused by the current approach -- do you only use the dummies to get the type IDs? What makes the previous approach no longer work?

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

do you only use the dummies to get the type IDs?

No, it uses the dummies to find out where the special tokens go. I stole that code from your code below.

The only reason I changed the dummy numbers is that we have to pass strings to the tokenizer now, not integers anymore. Token 1000 is a Chinese character (at least in bert-base) that'll be difficult to read for our mostly English speaking developers and users.

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

What makes the previous approach no longer work?

There were a few things, but the main one is this: In this version, token.text is always a slice from the original text, as provided by the offsets that the new tokenizers give us.

@ZhaofengWu
Copy link
Contributor

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

And we also need to supply that option in encode_plus? Or is it just for compatibility with the old tokenizers?

@dirkgr
Copy link
Member Author

dirkgr commented Mar 2, 2020

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

And we also need to supply that option in encode_plus? Or is it just for compatibility with the old tokenizers?

I put it there just to be safe.

The new tokenizers actually print a warning when you turn off special characters. I would have expected them to print a warning only when you pass the wrong thing into encode(), but it actually prints it all the time. I'm not sure what huggingface is intending here, but I felt this was the safest thing to do.

@dirkgr
Copy link
Member Author

dirkgr commented Mar 3, 2020

I added huggingface/transformers#3112, porting some of our tests to huggingface, so we have a better chance at getting compatible updates.

@dirkgr dirkgr changed the title Use the new tokenizers [WIP] Use the new tokenizers Mar 4, 2020
dirkgr added 2 commits April 14, 2020 10:48
# Conflicts:
#	allennlp/common/util.py
#	allennlp/data/token_indexers/pretrained_transformer_indexer.py
#	allennlp/data/tokenizers/pretrained_transformer_tokenizer.py
#	allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py
#	setup.py

# Huggingface tokenizers have different ways of remembering whether they lowercase or not. Detecting it
# this way seems like the least brittle way to do it.
# this way seems like the least brittle way to do it. )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

setup.py Outdated
@@ -67,7 +67,7 @@
"flaky",
"responses>=0.7",
"conllu==2.3.2",
"transformers>=2.8.0,<2.9.0",
"transformers==2.9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>=2.9,<2.10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dirkgr dirkgr marked this pull request as ready for review May 8, 2020 00:54
@dirkgr dirkgr requested review from matt-gardner and epwalsh May 8, 2020 00:54
@dirkgr
Copy link
Member Author

dirkgr commented May 8, 2020

This has gotten way out of control.

In huggingface, the tokenizers are the only source for how to properly format input for some transformer model. They are called "tokenizers", but they do all manner of preprocessing. This version basically reverse engineers a given huggingface tokenizer (with regards to special tokens and token type ids at least), so that we can perform those steps separately. I want to move to a system where tokenizers tokenize, and other parts of the system add special tokens, concatenate text fields, set type ids, split long sequences, and so on. This is not that system, but this is a step in that direction.

Practical benefits of this PR are faster tokenizers, and reliable offsets into the original text as long as the HF tokenizer supports it. Not all do, so sometimes we have to fall back to my crappy version of calculating offsets.

@@ -519,9 +519,9 @@ def sanitize_wordpiece(wordpiece: str) -> str:
if wordpiece.startswith("##"):
return wordpiece[2:]
elif wordpiece.startswith("Ġ"):
return wordpiece[1:]
return wordpiece.replace("Ġ", " ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this method might actually interact poorly with the demo. You'll be adding spaces in places that the demo isn't expecting it. Not sure how much a difference this makes, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huggingface have also changed their tack on this. At some point the argument was that 'Ġ' is just the encoding for space, and the space is part of the token, so I made it this way. But in the latest huggingface, the space is no longer part of the token. I'm hoping for clarification from @mfuntowicz on the matter, but I guess if in doubt, I'll make this match the current behavior.

I don't know if the demo is using sanitize_wordpiece() though. It's not a commonly used call as far as I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, you're right. I thought sanitize_wordpiece was called by sanitize (which is used in predictors), but it's not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this back.

"""
return tokens1 + (tokens2 or [])

def special_tokens_for_sequence(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name makes me expect a return type of List[str], instead of int. Same with below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name.

@@ -35,20 +35,6 @@ def test_as_array_produces_token_sequence_bert_cased(self):
indexed = indexer.tokens_to_indices(allennlp_tokens, vocab)
assert indexed["token_ids"] == expected_ids

def test_as_array_produces_token_sequence_bert_cased_sentence_pair(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need these tests anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are testing tokenize_sentence_pair(), which I removed. I resurrected them to test add_special_tokens().

Comment on lines 48 to 49
"NL",
"P",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you get the actual wordpieces now? Is that stored anywhere in the Token object? It'd be nice to have that show up in a test somewhere.

Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text, but I'd have to check to be sure.

Comment on lines 48 to 49
"NL",
"P",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.

self._stride = stride
self._truncation_strategy = truncation_strategy
self._calculate_character_offsets = calculate_character_offsets
# Reverse-engineer the tokenizer for two sequences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe split this into a separate method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into doing this, and I don't like what it becomes. We'd have to return 5 values from the method, and pass in two tokenizers. There is no way to get a tokenizer with special tokens from one without special tokens, and the function needs both. It's all solvable, but not worth it.


if self._calculate_character_offsets:
# If we don't have token offsets, try to calculate them ourselves.
if token_offsets is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this a separate method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OdTc3Ce

This methods properly handles a pair of sentences.
"""
return self._tokenize(sentence_1, sentence_2)
tokens.append(Token(text=token_str, text_id=token_id, type_id=token_type_id, idx=start))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I've missed or forgotten an explanation you already gave, but I don't really follow why this needs to be token_str, instead of just doing a reverse lookup on token_id, as (I believe) we had before. What do we gain from this? What do we lose? (I gave one of the latter in other comments on this PR.)

I'm pretty sure that nothing else in the indexing pipeline will look at token.text, because the pretrained transformer indexer can just look at text_id and type_id, and ignore token.text. So what's the point of this change? Is it to make offsets easier? So you can do idx + len(text) in order to get character offsets? If that's the problem that's being solved here, seems easier to just change idx to start_index and end_index, and we can keep the semantics of token.text the same.

And, sorry that I didn't realize this issue earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose was so that the length is correct. But I also like the intuition that you can concatenate tokens and get real text back.

On the other hand, I like the intuition that there is a 1:1 mapping between token string and token id. We can't have both.

Copy link
Contributor

@matt-gardner matt-gardner May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can have both by just adding another (optional) field to Token, like display_text or original_text, or something. And beyond intuition is the fact that we have parts of our code that rely on the 1:1 mapping, and it's not clear to me how I could modify that logic to handle the case where we lose the 1:1 mapping. It's pretty easy to think of how to modify whatever code depends on the length being right, if there is any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole Token class doesn't make sense for transformers. What's pos_ or lemma_ on a word piece?

After 1.0 ships we need to do some real clean-up of this API. I'm beginning to formulate ideas for 2.0.

Meanwhile, for 1.0, I'll add an idx_end field to track this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this back so that token.text is the word piece, and I added token.idx_end, which is one past the last character of the token.


def tokenize_sentence_pair(self, sentence_1: str, sentence_2: str) -> List[Token]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this added back anywhere. Is the functionality just removed, or did I miss it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that function is removed. You should now call tokenize() and add_special_tokens() separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now, thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, I was tempted to remove the _add_special_tokens flag completely, assume it's always true, and then deal with special tokens in a concatenate() call that does the right thing. But I decided that is too big a change right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that change, and I don't think it's very large (not saying that it should be done in this PR, though). We'll it removes an argument, so it's a breaking change, ok...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not big in terms of lines-of-code, but in downstream impact. I'd have to check and fix all of the readers to be compatible with that change.


def ids_to_tokens(
self, token_ids_a: List[int], token_ids_b: Optional[List[int]] = None
def tokens_from_string_tokens(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to be defining functions inside the call to intra_word_tokenize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? If I don't need the function anywhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just seems like unfortunate overhead that happens in middle of a pre-processing loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this, and I like the new version better. However, I always thought Python would JIT those functions once and never touch them again, so there is no overhead. So I tested it:

Before: 758 µs ± 943 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
After: 734 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So I guess it is 3% faster afterwards. Not a lot, but could matter in some cases. Interesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, thanks for measuring. I did not assume anything would get JITed, so I thought there would be overhead, but I had no intuition for how large of a difference it would make. Pretty small in the end, but if it's in the inner-loop of data processing, even small gains that increase GPU utilization are probably worth it.

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important stuff I cared about is done, the rest is minor. LGTM when you're happy with it.

@dirkgr
Copy link
Member Author

dirkgr commented May 12, 2020

Given that the model tests at allenai/allennlp-models#44 are passing, I'm merging this.

@dirkgr dirkgr merged commit e3d72fc into allenai:master May 12, 2020
@dirkgr dirkgr deleted the Tokenizers branch May 12, 2020 22:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants