-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
This is currently blocked on huggingface/transformers#3057. |
Do you think support for using both the fast and slow versions should be provided? Maybe the fast version is buggy for some cases, so then the user can choose slow one. |
"NL", | ||
"P", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code sets token.text
as a slice from the original text, using the offsets provided by the new tokenizers. The original test will never contain the ##
characters that we had there before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you get the actual wordpieces now? Is that stored anywhere in the Token
object? It'd be nice to have that show up in a test somewhere.
Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text
, but I'd have to check to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text
mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField
with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.
@@ -169,99 +160,59 @@ def test_token_idx_roberta(self): | |||
"<s>", | |||
"A", | |||
",", | |||
"Ġnaïve", # RoBERTa has a funny way of encoding combining characters. | |||
"naïve", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this breaks RoBERTa usage, because the tokenizer is what changed but the model is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RoBERTa usage is broken because of the leading spaces. The special characters are now handled the right way (except where they break the offsets).
@bryant1410 , @dirkgr We're (@HUGGINFACE) actively looking at all these blocking issues, thanks for reporting 👍 |
We could certainly back up to that possibility, but I was really excited about getting rid of the hacky token offset code we had there. If we also allow the slow tokenizers, we have to have that back. Edit: I also think that @mfuntowicz and co are on the case, so these won't be buggy for long. I'd rather not keep support for the old ones around to work around a problem that will disappear in a few days. |
Found some more: huggingface/transformers#3091, huggingface/transformers#3088. I'll go do something else now until the next huggingface release. No sense in filing 100 nearly identical issues :-) |
@ZhaofengWu, you might be interested in this, because it re-writes your version of |
Wait, there is |
I'm a bit confused by the current approach -- do you only use the dummies to get the type IDs? What makes the previous approach no longer work? |
The new fast tokenizers need to know at |
No, it uses the dummies to find out where the special tokens go. I stole that code from your code below. The only reason I changed the dummy numbers is that we have to pass strings to the tokenizer now, not integers anymore. Token |
There were a few things, but the main one is this: In this version, |
And we also need to supply that option in |
I put it there just to be safe. The new tokenizers actually print a warning when you turn off special characters. I would have expected them to print a warning only when you pass the wrong thing into |
I added huggingface/transformers#3112, porting some of our tests to huggingface, so we have a better chance at getting compatible updates. |
# Conflicts: # allennlp/common/util.py # allennlp/data/token_indexers/pretrained_transformer_indexer.py # allennlp/data/tokenizers/pretrained_transformer_tokenizer.py # allennlp/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py # setup.py
|
||
# Huggingface tokenizers have different ways of remembering whether they lowercase or not. Detecting it | ||
# this way seems like the least brittle way to do it. | ||
# this way seems like the least brittle way to do it. ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
setup.py
Outdated
@@ -67,7 +67,7 @@ | |||
"flaky", | |||
"responses>=0.7", | |||
"conllu==2.3.2", | |||
"transformers>=2.8.0,<2.9.0", | |||
"transformers==2.9", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>=2.9,<2.10
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Co-authored-by: Santiago Castro <bryant@montevideo.com.uy>
This has gotten way out of control. In huggingface, the tokenizers are the only source for how to properly format input for some transformer model. They are called "tokenizers", but they do all manner of preprocessing. This version basically reverse engineers a given huggingface tokenizer (with regards to special tokens and token type ids at least), so that we can perform those steps separately. I want to move to a system where tokenizers tokenize, and other parts of the system add special tokens, concatenate text fields, set type ids, split long sequences, and so on. This is not that system, but this is a step in that direction. Practical benefits of this PR are faster tokenizers, and reliable offsets into the original text as long as the HF tokenizer supports it. Not all do, so sometimes we have to fall back to my crappy version of calculating offsets. |
allennlp/common/util.py
Outdated
@@ -519,9 +519,9 @@ def sanitize_wordpiece(wordpiece: str) -> str: | |||
if wordpiece.startswith("##"): | |||
return wordpiece[2:] | |||
elif wordpiece.startswith("Ġ"): | |||
return wordpiece[1:] | |||
return wordpiece.replace("Ġ", " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this method might actually interact poorly with the demo. You'll be adding spaces in places that the demo isn't expecting it. Not sure how much a difference this makes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huggingface have also changed their tack on this. At some point the argument was that 'Ġ'
is just the encoding for space, and the space is part of the token, so I made it this way. But in the latest huggingface, the space is no longer part of the token. I'm hoping for clarification from @mfuntowicz on the matter, but I guess if in doubt, I'll make this match the current behavior.
I don't know if the demo is using sanitize_wordpiece()
though. It's not a commonly used call as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, you're right. I thought sanitize_wordpiece
was called by sanitize
(which is used in predictors), but it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this back.
""" | ||
return tokens1 + (tokens2 or []) | ||
|
||
def special_tokens_for_sequence(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name makes me expect a return type of List[str]
, instead of int
. Same with below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name.
@@ -35,20 +35,6 @@ def test_as_array_produces_token_sequence_bert_cased(self): | |||
indexed = indexer.tokens_to_indices(allennlp_tokens, vocab) | |||
assert indexed["token_ids"] == expected_ids | |||
|
|||
def test_as_array_produces_token_sequence_bert_cased_sentence_pair(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need these tests anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are testing tokenize_sentence_pair()
, which I removed. I resurrected them to test add_special_tokens()
.
"NL", | ||
"P", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you get the actual wordpieces now? Is that stored anywhere in the Token
object? It'd be nice to have that show up in a test somewhere.
Now that I think about this a bit more, I'm not sure how this interacts with the interpret code. I think we do vocab lookups on ids, instead of looking at token.text
, but I'd have to check to be sure.
"NL", | ||
"P", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, I think this might break: /~https://github.com/allenai/allennlp/blob/master/allennlp/interpret/attackers/hotflip.py#L312-L313. Dang, maybe we do need to have token.text
mean something consistent... Or at least have a consistent way to set the wordpiece text that doesn't depend on the tokenizer / indexer being used. I need an API call that will let me take a token ID from my vocabulary and replace a token in a TextField
with the text corresponding to that ID. I'm not sure what the best way to do that in this new world is, but that's a hard requirement.
self._stride = stride | ||
self._truncation_strategy = truncation_strategy | ||
self._calculate_character_offsets = calculate_character_offsets | ||
# Reverse-engineer the tokenizer for two sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe split this into a separate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into doing this, and I don't like what it becomes. We'd have to return 5 values from the method, and pass in two tokenizers. There is no way to get a tokenizer with special tokens from one without special tokens, and the function needs both. It's all solvable, but not worth it.
|
||
if self._calculate_character_offsets: | ||
# If we don't have token offsets, try to calculate them ourselves. | ||
if token_offsets is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a separate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This methods properly handles a pair of sentences. | ||
""" | ||
return self._tokenize(sentence_1, sentence_2) | ||
tokens.append(Token(text=token_str, text_id=token_id, type_id=token_type_id, idx=start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I've missed or forgotten an explanation you already gave, but I don't really follow why this needs to be token_str
, instead of just doing a reverse lookup on token_id
, as (I believe) we had before. What do we gain from this? What do we lose? (I gave one of the latter in other comments on this PR.)
I'm pretty sure that nothing else in the indexing pipeline will look at token.text
, because the pretrained transformer indexer can just look at text_id
and type_id
, and ignore token.text
. So what's the point of this change? Is it to make offsets easier? So you can do idx + len(text)
in order to get character offsets? If that's the problem that's being solved here, seems easier to just change idx
to start_index
and end_index
, and we can keep the semantics of token.text
the same.
And, sorry that I didn't realize this issue earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original purpose was so that the length is correct. But I also like the intuition that you can concatenate tokens and get real text back.
On the other hand, I like the intuition that there is a 1:1 mapping between token string and token id. We can't have both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can have both by just adding another (optional) field to Token
, like display_text
or original_text
, or something. And beyond intuition is the fact that we have parts of our code that rely on the 1:1 mapping, and it's not clear to me how I could modify that logic to handle the case where we lose the 1:1 mapping. It's pretty easy to think of how to modify whatever code depends on the length being right, if there is any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole Token
class doesn't make sense for transformers. What's pos_
or lemma_
on a word piece?
After 1.0 ships we need to do some real clean-up of this API. I'm beginning to formulate ideas for 2.0.
Meanwhile, for 1.0, I'll add an idx_end
field to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this back so that token.text
is the word piece, and I added token.idx_end
, which is one past the last character of the token.
|
||
def tokenize_sentence_pair(self, sentence_1: str, sentence_2: str) -> List[Token]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this added back anywhere. Is the functionality just removed, or did I miss it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that function is removed. You should now call tokenize()
and add_special_tokens()
separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I get it now, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, I was tempted to remove the _add_special_tokens
flag completely, assume it's always true, and then deal with special tokens in a concatenate()
call that does the right thing. But I decided that is too big a change right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that change, and I don't think it's very large (not saying that it should be done in this PR, though). We'll it removes an argument, so it's a breaking change, ok...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not big in terms of lines-of-code, but in downstream impact. I'd have to check and fix all of the readers to be compatible with that change.
|
||
def ids_to_tokens( | ||
self, token_ids_a: List[int], token_ids_b: Optional[List[int]] = None | ||
def tokens_from_string_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want to be defining functions inside the call to intra_word_tokenize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? If I don't need the function anywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just seems like unfortunate overhead that happens in middle of a pre-processing loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this, and I like the new version better. However, I always thought Python would JIT those functions once and never touch them again, so there is no overhead. So I tested it:
Before: 758 µs ± 943 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
After: 734 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So I guess it is 3% faster afterwards. Not a lot, but could matter in some cases. Interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, thanks for measuring. I did not assume anything would get JITed, so I thought there would be overhead, but I had no intuition for how large of a difference it would make. Pretty small in the end, but if it's in the inner-loop of data processing, even small gains that increase GPU utilization are probably worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important stuff I cared about is done, the rest is minor. LGTM when you're happy with it.
Given that the model tests at allenai/allennlp-models#44 are passing, I'm merging this. |
No description provided.