Remove `pad_with_end_token` argument in `CLIPTokenizer`. #2051

james77777778 · 2025-01-18T16:31:01Z

Related to #2018

I have verified that CLIPTokenizer should always use end_token_id as pad_token_id in both CLIP and SD3.
I’m not sure why I initially implemented pad_with_end_token arg. It might be related to how the internal SD3 code was implemented.

CLIPPreprocessor is still needed for both CLIP and SD3. When I implemented it, I didn’t have a clear idea about adding a task class for CLIP.
@mattdangerw @divyashreepathihalli WDYT?

The script to verify the outputs:

import keras
import numpy as np
import transformers

import keras_hub

text = "a cat sitting on the table"

tokenizer = keras_hub.models.Tokenizer.from_preset("clip_vit_base_patch32")
preprocessor = keras_hub.models.CLIPPreprocessor(tokenizer, sequence_length=16)
keras_results = preprocessor(text)
print(keras_results)

tokenizer = transformers.CLIPTokenizerFast.from_pretrained(
    "openai/clip-vit-base-patch32"
)
transformers_results = tokenizer(text, padding="max_length", max_length=16)
print(transformers_results)

np.testing.assert_allclose(
    keras.ops.convert_to_numpy(keras_results["token_ids"]),
    transformers_results["input_ids"],
)
np.testing.assert_allclose(
    keras.ops.convert_to_numpy(keras_results["padding_mask"]),
    transformers_results["attention_mask"],
)

james77777778 added the kokoro:force-run Runs Tests on GPU label Jan 18, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Jan 18, 2025

Remove pad_with_end_token argument.

8be7564

james77777778 force-pushed the clean-up-clip-tokenizer branch from 9871bb1 to 8be7564 Compare January 18, 2025 16:58

james77777778 added the kokoro:force-run Runs Tests on GPU label Jan 18, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Jan 18, 2025

james77777778 marked this pull request as draft January 19, 2025 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `pad_with_end_token` argument in `CLIPTokenizer`. #2051

Remove `pad_with_end_token` argument in `CLIPTokenizer`. #2051

james77777778 commented Jan 18, 2025 •

edited

Loading

Remove pad_with_end_token argument in CLIPTokenizer. #2051

Are you sure you want to change the base?

Remove pad_with_end_token argument in CLIPTokenizer. #2051

Conversation

james77777778 commented Jan 18, 2025 • edited Loading

Remove `pad_with_end_token` argument in `CLIPTokenizer`. #2051

Remove `pad_with_end_token` argument in `CLIPTokenizer`. #2051

james77777778 commented Jan 18, 2025 •

edited

Loading