`CLIPTokenizer` does not work as expected #2018

fdtomasi · 2024-12-11T16:00:01Z

To Reproduce

from keras_hub import models
tokenizer = models.Tokenizer.from_preset(
    "clip_vit_h_14_laion2b_s32b_b79k", 
    sequence_length=77,
    pad_with_end_token=True,
)
tokenizer = models.CLIPPreprocessor(tokenizer, sequence_length=77)
tokenizer(["a cat sitting on the table"])

which returns

{'token_ids': <tf.Tensor: shape=(1, 77), dtype=int32, numpy=
 array([[49406,   320,  2368,  4919,   525,   518,  2175,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0, 49407]], dtype=int32)>,
 'padding_mask': <tf.Tensor: shape=(1, 77), dtype=bool, numpy=
 array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True,  True,  True,  True,  True,
          True,  True,  True,  True,  True]])>}

This is surprising because of a few reasons. First, even if pad_with_end_token=True, the pad is using 0 (which correspond to ! in this vocabulary). Also, the end token is added at the end of the padding instead of the end of the original sequence.
Further, padding_mask is all True, while I would expect to be False in correspondence of padding tokens.

Additional context
Using keras_hub==0.18.1, keras==3.7.0.

The text was updated successfully, but these errors were encountered:

james77777778 · 2024-12-24T10:11:17Z

You can work around the issue by not specifying sequence_length in Tokenizer.
I have proposed a fix for this #2031

import keras_hub

preset = "clip_vit_h_14_laion2b_s32b_b79k"
text = ["a cat sitting on the table"]

tokenizer = keras_hub.models.Tokenizer.from_preset(
    preset, pad_with_end_token=True
)
preprocessor = keras_hub.models.CLIPPreprocessor(tokenizer, sequence_length=77)
print(preprocessor(text))

{'token_ids': Array([[49406,   320,  2368,  4919,   525,   518,  2175, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
        49407, 49407, 49407, 49407, 49407]], dtype=int32), 'padding_mask': Array([[ True,  True,  True,  True,  True,  True,  True,  True, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False]], dtype=bool)}

mattdangerw · 2025-01-08T19:35:49Z

Thanks for the bug! I think @james77777778's suggestion is the correct one, don't set the sequence length of both the tokenizer and preprocessor.

mattdangerw · 2025-01-08T19:40:12Z

In general, we want our tokenizers to just handle the string to ragged int mapping. Tokenizers should not pad. And then be composed with other layers (e.g. StartEndPacker) for special token packing and padding. The goal is to keep or tokenizer more narrow, and not turn into a layer that does everything. Flexibility through composition rather than sprawling init args.

However ClipTokenizer seems to buck this trend. Is there a reason we need to have the pad_with_end_token argument on the tokenizer at all? Also what is CLIPPreprocessor for? In general we have a tokenizer (unspecialized for any task), and a preprocessor for a specific task. We might want to do some cleanup of the CLIP API.

@divyashreepathihalli and @james77777778 what do you think?

james77777778 · 2025-01-14T15:17:25Z

@mattdangerw I wasn't aware of the tagging until today...

However ClipTokenizer seems to buck this trend. Is there a reason we need to have the pad_with_end_token argument on the tokenizer at all?

That option is required by some downstream tasks, such as SD3.

Also what is CLIPPreprocessor for?

It is currently specific to SD3.

These impls might be tailored toward SD3 since they were initially developed for use in SD3. I can propose a PR to refactor them.

mehtamansi29 self-assigned this Dec 12, 2024

mehtamansi29 added the type:Bug Something isn't working label Dec 23, 2024

james77777778 mentioned this issue Dec 24, 2024

Fix sequence_length option in CLIPTokenizer #2031

Closed

james77777778 mentioned this issue Jan 18, 2025

Remove pad_with_end_token argument in CLIPTokenizer. #2051

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CLIPTokenizer` does not work as expected #2018

`CLIPTokenizer` does not work as expected #2018

fdtomasi commented Dec 11, 2024

james77777778 commented Dec 24, 2024

mattdangerw commented Jan 8, 2025

mattdangerw commented Jan 8, 2025

james77777778 commented Jan 14, 2025

CLIPTokenizer does not work as expected #2018

CLIPTokenizer does not work as expected #2018

Comments

fdtomasi commented Dec 11, 2024

james77777778 commented Dec 24, 2024

mattdangerw commented Jan 8, 2025

mattdangerw commented Jan 8, 2025

james77777778 commented Jan 14, 2025

`CLIPTokenizer` does not work as expected #2018

`CLIPTokenizer` does not work as expected #2018