Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use E5 text encoder for SDXL #108

Closed
wants to merge 26 commits into from

Conversation

A-Jacobson
Copy link
Contributor

@A-Jacobson A-Jacobson commented Jan 3, 2024

adds a simple switch to use e5 text encoder with SDXL. this is accomplished by splicing e5 into our joint text encoder class. Currently, this approach has some limitations:

  • I wanted to avoid changing the API + building out a large registry or text encoders so only e5-large-v2 is currently supported. (t5 and other text encoders i tester were not supported by automodel/autotokenizer so I opted to keep it simple for now and leave them out.
  • decided to stick with openclip vs openai clip as the HF model supports the projection layer needed for SDXL out of the box
  • truncate sequence max-length to 77 (clip max len) vs 512 (e5 max length).

to enable e5 add use_e5: true to both your dataset and model.

after feedback, model_name and tokenizer_name_or_path now need to be updated to sdxl-e5 to enable e5 training.

@A-Jacobson A-Jacobson changed the title Add option to use E5 text encoder to SDXL Add option to use E5 text encoder for SDXL Jan 3, 2024
@A-Jacobson
Copy link
Contributor Author

@jazcollins if you could confirm i didn't horrendously break the tokenizers or sdxl code you wrote that would be great =)

Copy link
Contributor

@jazcollins jazcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me aside from some small suggestions to remove the use_e5 flag!

Also - do we want to truncate e5 tokenizer by the CLIP tokenizer's max_length? As the code is currently is currently written - yes, we have to, because we stack the tokenized outputs in the dataloader. However, we don't have to do that, and could potentially have different length tokenized outputs for the two text encoders if that makes sense to do.

@A-Jacobson
Copy link
Contributor Author

Also - do we want to truncate e5 tokenizer by the CLIP tokenizer's max_length? As the code is currently is currently written - yes, we have to, because we stack the tokenized outputs in the dataloader. However, we don't have to do that, and could potentially have different length tokenized outputs for the two text encoders if that makes sense to do.

I believe they have to be the same length because they're concatenated on the embedding dim not the sequence dim later on. We can't do concatenation unless we either pad clip to 512 or truncate e5 to 77. e5 uses wordpiece and clip uses BPE so I THINK the # of tokens per prompt should be similar.

@A-Jacobson
Copy link
Contributor Author

handled by #124

@A-Jacobson A-Jacobson closed this Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants