-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIPTokenizer
does not work as expected
#2018
Comments
You can work around the issue by not specifying import keras_hub
preset = "clip_vit_h_14_laion2b_s32b_b79k"
text = ["a cat sitting on the table"]
tokenizer = keras_hub.models.Tokenizer.from_preset(
preset, pad_with_end_token=True
)
preprocessor = keras_hub.models.CLIPPreprocessor(tokenizer, sequence_length=77)
print(preprocessor(text)) {'token_ids': Array([[49406, 320, 2368, 4919, 525, 518, 2175, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
49407, 49407, 49407, 49407, 49407]], dtype=int32), 'padding_mask': Array([[ True, True, True, True, True, True, True, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False]], dtype=bool)} |
Thanks for the bug! I think @james77777778's suggestion is the correct one, don't set the sequence length of both the tokenizer and preprocessor. |
In general, we want our tokenizers to just handle the string to ragged int mapping. Tokenizers should not pad. And then be composed with other layers (e.g. However @divyashreepathihalli and @james77777778 what do you think? |
@mattdangerw I wasn't aware of the tagging until today...
That option is required by some downstream tasks, such as SD3.
It is currently specific to SD3. These impls might be tailored toward SD3 since they were initially developed for use in SD3. I can propose a PR to refactor them. |
To Reproduce
which returns
This is surprising because of a few reasons. First, even if
pad_with_end_token=True
, the pad is using 0 (which correspond to!
in this vocabulary). Also, the end token is added at the end of the padding instead of the end of the original sequence.Further,
padding_mask
is all True, while I would expect to be False in correspondence of padding tokens.Additional context
Using
keras_hub==0.18.1
,keras==3.7.0
.The text was updated successfully, but these errors were encountered: