Official Implementation of "AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image" by Lianyu Pang, Jian Yin, Baoquan Zhao, Qing Li and Xudong Mao.
Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while DreamBooth often overlooks it. We attribute these issues to the incorrect learning of the embedding alignment for the concept. We introduce AttnDreamBooth, a novel approach that addresses these issues by separately learning the embedding alignment, the attention map, and the subject identity in different training stages. We also introduce a cross-attention map regularization term to enhance the learning of the attention map. Our method demonstrates significant improvements in identity preservation and text alignment compared to the baseline methods. Code will be made publicly available.
Our code is primarily based on Diffusers-DreamBooth and relies on the diffusers library.
To set up the environment, run the following commands:
conda env create -f environment.yaml
conda activate AttnDreamBooth
Initialize an Accelerate environment with:
accelerate config
To use stabilityai/stable-diffusion-2-1-base
model, you may have to log into Huggingface as follows
- Use
huggingface-cli
to log in via the Terminal - Input your token extracted from Token
Our datasets were originally collected and are provided by Textual Inversion and DreamBooth.
We provide pretrained checkpoints for two objects. You can download the sample images and their corresponding pretrained checkpoints.
Concepts | Samples | Models |
---|---|---|
child doll | images | model |
grey sloth | images | model |
You can run the bash_script/train_attndreambooth.sh
script to train your own model. Before executing the training command, ensure that you have configured the following parameters in train_attndreambooth.sh
:
- Line 6:
output_dir
. This is the directory where the fine-tuned model will be saved. - Line 8:
instance_dir
. This is the directory containing the images of the target concept. - Line 10:
category
. This is the category of the target concept.
For example, to train the concept child doll
in the Pretrained Checkpoints, you need to set the parameters as follows.
output_dir="./models/"
instance_dir="./dataset/child_doll"
category="doll"
To run the training script, use the following command.
bash bash_script/train_attndreambooth.sh
Notes:
- All training arguments can be found in
train_attndreambooth.sh
and are set to their defaults according to the official paper. - Please refer to
train_attndreambooth.sh
andtrain_attndreambooth.py
for more details on all parameters.
We have explored a simple yet effective strategy to reduce the training time of our method by increasing the learning rate while simultaneously decreasing both the training steps and the batch size for our third training stage, which significantly reduces the training time from 20 minutes to 6 minutes on average. And We observed that the fast version model performs very closely to the original model for short prompts, but it slightly under-performs for complex prompts.
To use the fast version of AttnDreamBooth, set the config of stage 3 in bash_script\train_attndreambooth.sh
as follows.
unet_learning_rate="1e-5"
unet_save_step=200
unet_train_steps=200
unet_attn_mean=2
unet_attn_var=5
unet_bs=4
unet_ga=1
unet_validation_steps=100
You can run the bash_script/inference.sh
script to generate images. Before executing the inference command, ensure that you have configured the following parameters in inference.sh
:
- Line 2:
learned_embedding_path
. This is the path to the embeddings learned in the first stage. - Line 4:
checkpoint_path
. This is the path to the fine-tuned models trained in the third stage. - Line 6:
category
. This is the category of the target concept. - Line 8:
output_dir
. This is the directory where the generated images will be saved.
To run the inference, use the following command.
bash bash_script/inference.sh
Notes:
- If you did not set
--only_save_checkpoints
during the training phase, you can specify--pretrained_model_name_or_path
as the path to the full model, and then omit--checkpoint_path
. - We offer learned embeddings and models for two objects here for direct experimentation.
- For convenience, you can either specify a path of a text file with
--prompt_file
, where each line contains a prompt. For example:
A photo of a {}
A {} floats on the water
A {} latte art
-
Specify the concept using
{}
, and we will replace it with the concept’s placeholder token and the specified category. -
The resulting images will be saved in the directory
{save_dir}/{prompt}
-
For detailed information on all parameters, please consult
inference.py
andinference.sh
.
We use the same evaluation protocol as used in Textual Inversion.
Our code mainly bases on Diffusers-DreamBooth. A huge thank you to the authors for their valuable contributions.
@article{pang2024attndreambooth,
title={AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation},
author={Pang, Lianyu and Yin, Jian and Zhao, Baoquan and Wu, Feize and Wang, Fu Lee and Li, Qing and Mao, Xudong},
journal={arXiv preprint arXiv:2406.05000},
year={2024}
}