Use the new tokenizers #3868

dirkgr · 2020-02-28T00:11:46Z

No description provided.

dirkgr · 2020-02-28T01:09:14Z

This is currently blocked on huggingface/transformers#3057.

dirkgr · 2020-02-28T01:22:56Z

Also huggingface/transformers#2917.

dirkgr · 2020-02-28T01:28:24Z

And huggingface/transformers#3058.

…Tokenizers

bryant1410 · 2020-02-28T02:57:21Z

Do you think support for using both the fast and slow versions should be provided? Maybe the fast version is buggy for some cases, so then the user can choose slow one.

bryant1410 · 2020-02-28T02:57:51Z

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

+            "NL",
+            "P",


Why were these wrong?

The new code sets token.text as a slice from the original text, using the offsets provided by the new tokenizers. The original test will never contain the ## characters that we had there before.

Where do you get the actual wordpieces now? Is that stored anywhere in the Token object? It'd be nice to have that show up in a test somewhere.

Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text, but I'd have to check to be sure.

Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.

bryant1410 · 2020-02-28T03:00:15Z

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

@@ -169,99 +160,59 @@ def test_token_idx_roberta(self):
            "<s>",
            "A",
            ",",
-            "ĠnaÃ¯ve",  # RoBERTa has a funny way of encoding combining characters.
+            "naïve",


I wonder if this breaks RoBERTa usage, because the tokenizer is what changed but the model is the same.

RoBERTa usage is broken because of the leading spaces. The special characters are now handled the right way (except where they break the offsets).

mfuntowicz · 2020-02-28T09:19:01Z

@bryant1410 , @dirkgr We're (@HUGGINFACE) actively looking at all these blocking issues, thanks for reporting 👍

dirkgr · 2020-02-28T22:46:31Z

Do you think support for using both the fast and slow versions should be provided? Maybe the fast version is buggy for some cases, so then the user can choose slow one.

We could certainly back up to that possibility, but I was really excited about getting rid of the hacky token offset code we had there. If we also allow the slow tokenizers, we have to have that back.

Edit: I also think that @mfuntowicz and co are on the case, so these won't be buggy for long. I'd rather not keep support for the old ones around to work around a problem that will disappear in a few days.

dirkgr · 2020-03-02T22:26:05Z

Found some more: huggingface/transformers#3091, huggingface/transformers#3088.

I'll go do something else now until the next huggingface release. No sense in filing 100 nearly identical issues :-)

…Tokenizers

dirkgr · 2020-03-02T22:29:09Z

@ZhaofengWu, you might be interested in this, because it re-writes your version of intra_word_tokenize().

ZhaofengWu · 2020-03-02T22:55:58Z

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

ZhaofengWu · 2020-03-02T23:02:44Z

I'm a bit confused by the current approach -- do you only use the dummies to get the type IDs? What makes the previous approach no longer work?

dirkgr · 2020-03-02T23:22:40Z

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

dirkgr · 2020-03-02T23:25:42Z

do you only use the dummies to get the type IDs?

No, it uses the dummies to find out where the special tokens go. I stole that code from your code below.

The only reason I changed the dummy numbers is that we have to pass strings to the tokenizer now, not integers anymore. Token 1000 is a Chinese character (at least in bert-base) that'll be difficult to read for our mostly English speaking developers and users.

dirkgr · 2020-03-02T23:27:10Z

What makes the previous approach no longer work?

There were a few things, but the main one is this: In this version, token.text is always a slice from the original text, as provided by the offsets that the new tokenizers give us.

ZhaofengWu · 2020-03-02T23:28:20Z

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

And we also need to supply that option in encode_plus? Or is it just for compatibility with the old tokenizers?

dirkgr · 2020-03-02T23:30:00Z

Wait, there is add_special_tokens in encode_plus and the constructor? Why is that necessary?

The new fast tokenizers need to know at __init__ time whether they produce special tokens or not. Because we need both behaviors, I create one of each and then use the right one when we need them.

And we also need to supply that option in encode_plus? Or is it just for compatibility with the old tokenizers?

I put it there just to be safe.

The new tokenizers actually print a warning when you turn off special characters. I would have expected them to print a warning only when you pass the wrong thing into encode(), but it actually prints it all the time. I'm not sure what huggingface is intending here, but I felt this was the safest thing to do.

dirkgr · 2020-03-03T21:50:09Z

I added huggingface/transformers#3112, porting some of our tests to huggingface, so we have a better chance at getting compatible updates.

# Conflicts: # allennlp/common/util.py # allennlp/data/token_indexers/pretrained_transformer_indexer.py # allennlp/data/tokenizers/pretrained_transformer_tokenizer.py # allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py # setup.py

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

bryant1410 · 2020-05-08T00:34:43Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py


        # Huggingface tokenizers have different ways of remembering whether they lowercase or not. Detecting it
-        # this way seems like the least brittle way to do it.
+        # this way seems like the least brittle way to do it.		         )


bryant1410 · 2020-05-08T00:38:03Z

setup.py

@@ -67,7 +67,7 @@
        "flaky",
        "responses>=0.7",
        "conllu==2.3.2",
-        "transformers>=2.8.0,<2.9.0",
+        "transformers==2.9",


>=2.9,<2.10?

Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>

…Tokenizers

dirkgr · 2020-05-08T01:00:38Z

This has gotten way out of control.

In huggingface, the tokenizers are the only source for how to properly format input for some transformer model. They are called "tokenizers", but they do all manner of preprocessing. This version basically reverse engineers a given huggingface tokenizer (with regards to special tokens and token type ids at least), so that we can perform those steps separately. I want to move to a system where tokenizers tokenize, and other parts of the system add special tokens, concatenate text fields, set type ids, split long sequences, and so on. This is not that system, but this is a step in that direction.

Practical benefits of this PR are faster tokenizers, and reliable offsets into the original text as long as the HF tokenizer supports it. Not all do, so sometimes we have to fall back to my crappy version of calculating offsets.

matt-gardner · 2020-05-08T17:14:44Z

allennlp/common/util.py

@@ -519,9 +519,9 @@ def sanitize_wordpiece(wordpiece: str) -> str:
    if wordpiece.startswith("##"):
        return wordpiece[2:]
    elif wordpiece.startswith("Ġ"):
-        return wordpiece[1:]
+        return wordpiece.replace("Ġ", " ")


Changing this method might actually interact poorly with the demo. You'll be adding spaces in places that the demo isn't expecting it. Not sure how much a difference this makes, though.

Huggingface have also changed their tack on this. At some point the argument was that 'Ġ' is just the encoding for space, and the space is part of the token, so I made it this way. But in the latest huggingface, the space is no longer part of the token. I'm hoping for clarification from @mfuntowicz on the matter, but I guess if in doubt, I'll make this match the current behavior.

I don't know if the demo is using sanitize_wordpiece() though. It's not a commonly used call as far as I know.

Oh, sorry, you're right. I thought sanitize_wordpiece was called by sanitize (which is used in predictors), but it's not.

I've changed this back.

matt-gardner · 2020-05-08T17:19:52Z

allennlp/data/tokenizers/tokenizer.py

+        """
+        return tokens1 + (tokens2 or [])
+
+    def special_tokens_for_sequence(self) -> int:


This name makes me expect a return type of List[str], instead of int. Same with below.

I changed the name.

matt-gardner · 2020-05-08T17:20:52Z

allennlp/tests/data/token_indexers/pretrained_transformer_indexer_test.py

@@ -35,20 +35,6 @@ def test_as_array_produces_token_sequence_bert_cased(self):
        indexed = indexer.tokens_to_indices(allennlp_tokens, vocab)
        assert indexed["token_ids"] == expected_ids

-    def test_as_array_produces_token_sequence_bert_cased_sentence_pair(self):


Why don't we need these tests anymore?

They are testing tokenize_sentence_pair(), which I removed. I resurrected them to test add_special_tokens().

matt-gardner · 2020-05-08T17:24:08Z

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

+            "NL",
+            "P",


Where do you get the actual wordpieces now? Is that stored anywhere in the Token object? It'd be nice to have that show up in a test somewhere.

Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text, but I'd have to check to be sure.

matt-gardner · 2020-05-08T17:30:32Z

allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py

+            "NL",
+            "P",


Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.

matt-gardner · 2020-05-08T18:00:04Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

-        self._stride = stride
-        self._truncation_strategy = truncation_strategy
-        self._calculate_character_offsets = calculate_character_offsets
+        # Reverse-engineer the tokenizer for two sequences


Maybe split this into a separate method?

I looked into doing this, and I don't like what it becomes. We'd have to return 5 values from the method, and pass in two tokenizers. There is no way to get a tokenizer with special tokens from one without special tokens, and the function needs both. It's all solvable, but not worth it.

matt-gardner · 2020-05-08T18:01:03Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py


-        if self._calculate_character_offsets:
+        # If we don't have token offsets, try to calculate them ourselves.
+        if token_offsets is None:


Maybe make this a separate method?

matt-gardner · 2020-05-08T18:05:00Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

-        This methods properly handles a pair of sentences.
-        """
-        return self._tokenize(sentence_1, sentence_2)
+            tokens.append(Token(text=token_str, text_id=token_id, type_id=token_type_id, idx=start))


Sorry if I've missed or forgotten an explanation you already gave, but I don't really follow why this needs to be token_str, instead of just doing a reverse lookup on token_id, as (I believe) we had before. What do we gain from this? What do we lose? (I gave one of the latter in other comments on this PR.)

I'm pretty sure that nothing else in the indexing pipeline will look at token.text, because the pretrained transformer indexer can just look at text_id and type_id, and ignore token.text. So what's the point of this change? Is it to make offsets easier? So you can do idx + len(text) in order to get character offsets? If that's the problem that's being solved here, seems easier to just change idx to start_index and end_index, and we can keep the semantics of token.text the same.

And, sorry that I didn't realize this issue earlier.

The original purpose was so that the length is correct. But I also like the intuition that you can concatenate tokens and get real text back.

On the other hand, I like the intuition that there is a 1:1 mapping between token string and token id. We can't have both.

Well, we can have both by just adding another (optional) field to Token, like display_text or original_text, or something. And beyond intuition is the fact that we have parts of our code that rely on the 1:1 mapping, and it's not clear to me how I could modify that logic to handle the case where we lose the 1:1 mapping. It's pretty easy to think of how to modify whatever code depends on the length being right, if there is any.

The whole Token class doesn't make sense for transformers. What's pos_ or lemma_ on a word piece?

After 1.0 ships we need to do some real clean-up of this API. I'm beginning to formulate ideas for 2.0.

Meanwhile, for 1.0, I'll add an idx_end field to track this.

I changed this back so that token.text is the word piece, and I added token.idx_end, which is one past the last character of the token.

matt-gardner · 2020-05-08T18:10:54Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py


-    def tokenize_sentence_pair(self, sentence_1: str, sentence_2: str) -> List[Token]:


I don't see this added back anywhere. Is the functionality just removed, or did I miss it?

Yes, that function is removed. You should now call tokenize() and add_special_tokens() separately.

Ah, I get it now, thanks.

At some point, I was tempted to remove the _add_special_tokens flag completely, assume it's always true, and then deal with special tokens in a concatenate() call that does the right thing. But I decided that is too big a change right now.

I agree with that change, and I don't think it's very large (not saying that it should be done in this PR, though). We'll it removes an argument, so it's a breaking change, ok...

Yes, it's not big in terms of lines-of-code, but in downstream impact. I'd have to check and fix all of the readers to be compatible with that change.

matt-gardner · 2020-05-08T18:11:43Z

allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

-
-    def ids_to_tokens(
-        self, token_ids_a: List[int], token_ids_b: Optional[List[int]] = None
+        def tokens_from_string_tokens(


Do you really want to be defining functions inside the call to intra_word_tokenize?

Why not? If I don't need the function anywhere else?

It just seems like unfortunate overhead that happens in middle of a pre-processing loop.

I refactored this, and I like the new version better. However, I always thought Python would JIT those functions once and never touch them again, so there is no overhead. So I tested it:

Before: 758 µs ± 943 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
After: 734 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So I guess it is 3% faster afterwards. Not a lot, but could matter in some cases. Interesting.

Huh, thanks for measuring. I did not assume anything would get JITed, so I thought there would be overhead, but I had no intuition for how large of a difference it would make. Pretty small in the end, but if it's in the inner-loop of data processing, even small gains that increase GPU utilization are probably worth it.

matt-gardner

The important stuff I cared about is done, the rest is minor. LGTM when you're happy with it.

dirkgr · 2020-05-12T22:30:39Z

Given that the model tests at allenai/allennlp-models#44 are passing, I'm merging this.

dirkgr added 2 commits February 27, 2020 16:08

Use the new tokenizers

592362b

Merge branch 'master' into Tokenizers

bc43649

dirkgr added 2 commits February 27, 2020 17:47

Fix various problems with the tests

2742a6e

Merge branch 'Tokenizers' of /~https://github.com/dirkgr/allennlp into …

02d9a80

…Tokenizers

bryant1410 reviewed Feb 28, 2020

View reviewed changes

Merge branch 'master' into Tokenizers

5d5c01d

dirkgr added 3 commits March 2, 2020 14:27

Matching implementation of intra_word_tokenize

d1b21d5

Update some of the tests

163568d

Merge branch 'Tokenizers' of /~https://github.com/dirkgr/allennlp into …

a5be9d6

…Tokenizers

dirkgr changed the title ~~Use the new tokenizers~~ [WIP] Use the new tokenizers Mar 4, 2020

dirkgr mentioned this pull request Mar 23, 2020

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？ #3458

Closed

dirkgr added 2 commits April 14, 2020 10:48

Fix changed API

aecff02

dirkgr added 3 commits May 7, 2020 17:16

Merge remote-tracking branch 'origin/master' into Tokenizers

4d2b578

Depend on a released version of transformers

64ff7c5

Formatting

38117bc

bryant1410 reviewed May 8, 2020

View reviewed changes

dirkgr and others added 4 commits May 7, 2020 17:52

Update allennlp/data/tokenizers/pretrained_transformer_tokenizer.py

450adf1

Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>

Fix copy and paste error

cb7f41f

Be more flexible with the transformers version

9cb142a

Merge branch 'Tokenizers' of /~https://github.com/dirkgr/allennlp into …

3fe2587

…Tokenizers

dirkgr marked this pull request as ready for review May 8, 2020 00:54

dirkgr requested review from matt-gardner and epwalsh May 8, 2020 00:54

Productivity through formatting

f43f929

matt-gardner reviewed May 8, 2020

View reviewed changes

dirkgr added 3 commits May 8, 2020 11:50

Merge branch 'master' into Tokenizers

196925b

Merge remote-tracking branch 'origin/master' into Tokenizers

3ff38c5

token.text is now the word piece again

2b14550

matt-gardner approved these changes May 11, 2020

View reviewed changes

dirkgr added 7 commits May 11, 2020 14:53

New way of doing pairs

3f89d99

Formatting

617f5c7

Slight refactoring

450f0e2

Refactoring

df6f7ec

Merge remote-tracking branch 'origin/master' into Tokenizers

b6db7ca

Formatting and mypy

1163b66

Merge branch 'master' into Tokenizers

fced896

dirkgr merged commit e3d72fc into allenai:master May 12, 2020

dirkgr deleted the Tokenizers branch May 12, 2020 22:30

wlhgtc mentioned this pull request Aug 4, 2020

How can we tokenize sentence pair with 'PretrainedTransformerTokenizer' #4532

Closed


		def tokenize_sentence_pair(self, sentence_1: str, sentence_2: str) -> List[Token]:

Use the new tokenizers #3868

Use the new tokenizers #3868

Conversation

dirkgr commented Feb 28, 2020

dirkgr commented Feb 28, 2020

dirkgr commented Feb 28, 2020

dirkgr commented Feb 28, 2020

bryant1410 commented Feb 28, 2020

bryant1410 Feb 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfuntowicz commented Feb 28, 2020

dirkgr commented Feb 28, 2020 • edited Loading

dirkgr commented Mar 2, 2020

dirkgr commented Mar 2, 2020

ZhaofengWu commented Mar 2, 2020

ZhaofengWu commented Mar 2, 2020

dirkgr commented Mar 2, 2020

dirkgr commented Mar 2, 2020

dirkgr commented Mar 2, 2020

ZhaofengWu commented Mar 2, 2020

dirkgr commented Mar 2, 2020 • edited Loading

dirkgr commented Mar 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented May 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner May 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

dirkgr commented May 12, 2020

bryant1410 Feb 28, 2020 •

edited

Loading

dirkgr commented Feb 28, 2020 •

edited

Loading

dirkgr commented Mar 2, 2020 •

edited

Loading

matt-gardner May 8, 2020 •

edited

Loading