Objects Count Optimization for Text-to-image Diffusion Models

This repository contains the code related to our paper Objects Count Optimization for Text-to-image Diffusion Models.

Oz Zafar*¹, Idan Schwartz*¹, Lior Wolf¹ ¹Tel Aviv University * Denotes equal contribution

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with this task due to the impossibility of finding an image for every number. We propose a novel technique that iteratively modifies the text conditioning and generates images, adjusting the number of objects via a counting loss, which is derived from the aggregation of attention map peaks. Our method offers three key advantages: (i) it is a zero-shot method requiring no additional training; (ii) it is a plug-and-play solution facilitating rapid changes to the counting and SD method; and (iii) it provides fine-grained user control. Through assessments of the generation of various objects, we demonstrate that our approach significantly enhances accuracy.

We propose a plug-and-play optimization of object counting accuracy of a text-to-image model based on detection models.

Installations:

Hugging face

Run this command to log in with your HF Hub token if you haven't before:

huggingface-cli login

Create conda environment

conda env create -f requirements.yml

conda activate objects_count_optimization

Dependencies

Object Counting model

Our optimization is based on CLIP-COUNT, a vision-language model for class-agnostic object counting.

The code can be easily adapted to other models, if you will to utilize CLIP-COUNT download the pre-trained weights from their repository and locate it under the local clip_count folder.

Evaluation

Our evaluation is based both on CLIP-COUNT and YOLO.

For CLIP-COUNT setup, refer to previous section.

For YOLO setup, please refer to YOLOv9 docs.

Run and Evaluate:

An overview of our method for optimizing a new discriminative token representation ($v_c$) using a pre-trained object detection model. For the prompt `A photo of a $S_c$ 6 beads,' we expect the output generated with the count $c$ to be 6. The object detection model, however, indicates that the amount of beads in the generated image is a lot bigger. We generate images iteratively and optimize the token representation using MSE loss. Once $v_c$ has been trained, more images of the target amount can be generated by including it in the context of the input text.

To train and evaluate use: python run.py --clazz beads --amount 6 --train True --evaluate True

Hyperparameters:

The hyperparameters can be changed in the config.py script. Note that the paper results are based on SDXL-turbo.

Outputs

The script will create folders and store tokens representation in token and the images in img.

Citation

If you make use of our work, please cite our paper:

@misc{zafar2024iterativeobjectcountoptimization,
      title={Iterative Object Count Optimization for Text-to-image Diffusion Models}, 
      author={Oz Zafar and Lior Wolf and Idan Schwartz},
      year={2024},
      eprint={2408.11721},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.11721}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
clip_count		clip_count
controlnet		controlnet
datasets		datasets
diffusers		diffusers
docs		docs
figures		figures
human_study		human_study
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
optimized_text_to_image_objects_count.ipynb		optimized_text_to_image_objects_count.ipynb
plots.py		plots.py
requirements.yml		requirements.yml
run.py		run.py
utils.py		utils.py
vitmae.py		vitmae.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Objects Count Optimization for Text-to-image Diffusion Models

Installations:

Hugging face

Create conda environment

Dependencies

Object Counting model

Evaluation

Run and Evaluate:

Hyperparameters:

Outputs

Citation

About

Releases

Packages

Languages

License

ozzafar/discriminative_class_tokens_for_counting

Folders and files

Latest commit

History

Repository files navigation

Objects Count Optimization for Text-to-image Diffusion Models

Installations:

Hugging face

Create conda environment

Dependencies

Object Counting model

Evaluation

Run and Evaluate:

Hyperparameters:

Outputs

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages