GlueGen

This repository is for the paper:

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
Can Qin ¹, Ning Yu ², Chen Xing ², Shu Zhang ², Zeyuan Chen ², Stefano Ermon ³, Yun Fu ¹, Caiming Xiong ², Ran Xu ²
¹ Northeastern University ² Salesforce AI Research ³ Stanford Univerisy
Work done when Can Qin was an intern at Salesforce AI Research.

With the proposed GlueNet model of the GlueGen framework, the pre-trained image generator (i.e., UNet) can be bridged to off-the-shelf single- or multi-modal encoders to expand their functionalities, i.e., multilingual/sound-to-image generation, within a limited budget. GlueNet is trained offline and does not require back-propagation of UNet and image-text pairs for training. Therefore, GlueGen is flexible and efficient to achieve.

Multilingual Text to Image Generation

Multilingual text-to-image generation results in resolution 512 * 512 of XLM-Roberta + Glue-Net + SDM decoder with the same caption, ``afternoon garden oil painting painted by impressionists".

Sound to Image Generation

Example sound (Urbansound8k) to image generation results.

Sound-text-mix to Image Generation

(a) and (b) are example sound-text-mix to image generation results.

Instruction for GlueGen

Environment Preparation

Setup the env of stable-diffusion first (need to wait a few minutes).

cd ./stable-diffusion
PIP_EXISTS_ACTION=w conda env create -f environment.yaml
conda activate gluegen

Then, install the packages for audioclip.

cd ./stable-diffusion/audioclip
pip install -r requirements.txt
pip install -U llvmlite==0.32.1
pip install -e .

Download Checkpoints

Download the official checkpoints of SD v1 to ./checkpoints_all/checkpoint_sd_v1 as ./checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt (downloaded from https://huggingface.co/runwayml/stable-diffusion-v1-5).

Then follow the README.md (./stable-diffusion/audioclip/README.md) of audioclip to download checkpoints to ./checkpoints_all/audioclip_checkpoint as ./checkpoints_all/audioclip_checkpoint/AudioCLIP-Full-Training.pt.

mkdir ./checkpoints_all/audioclip_checkpoint
cd ./checkpoints_all/audioclip_checkpoint
wget /~https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt

Then download the pretrained gluenet checkpoints and save them to ./checkpoints_all/gluenet_checkpoint:

bash download_gluenet_checkpoints.sh

Download Datasets

Download audio dataset (urbansound8k) to ./data as ./data/urbansound8k

bash download_us8k_data.sh

Download multilingual text dataset to ./data

bash download_multilingual_data.sh

Running Inference Code

Multilingual Stable Diffusion Inference:

cd stable-diffusion

python scripts/txt2img_demo_ml.py --prompt "下午的花园的印象派绘画" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Peinture impressionniste d'un jardin d'après-midi" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Pintura impresionista de un jardín de tarde" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "午後の庭の印象派絵画" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

python scripts/txt2img_demo_ml.py --prompt "Pittura impressionista di un giardino pomeridiano" --plms --outdir outputs/text2img-multilingual --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt

Sound-to-image Stable Diffusion Inference:

cd stable-diffusion

python scripts/sound2img_gluegen.py --plms --ckpt ../checkpoints_all/checkpoint_sd_v1/v1-5-pruned-emaonly.ckpt --outdir outputs/sound2img --config configs/stable-diffusion/v1-inference-trans-audioclip.yaml --scale 7.5  --n_iter 1 --audioclip_ckpt ../checkpoints_all/audioclip_checkpoint/AudioCLIP-Full-Training.pt

Running Training Code

Sound-to-image GlueNet Training:

cd ./sound-gluenet
CUDA_VISIBLE_DEVICES=0 python train_gluenet_sound_text.py

Multilingual Text-to-image GlueNet Training:

cd ./multilingual-gluenet
CUDA_VISIBLE_DEVICES=0 python train_gluenet_multi.py --DATA_PATH_SRC ../data/WikiMatrix.en-zh.txt.en --DATA_PATH_TAR ../data/WikiMatrix.en-zh.txt.zh --DATA_PATH_SRC_1 ../data/laion-1M-trans-en-zh-cn-en.txt --DATA_PATH_TAR_1 ../data/laion-1M-trans-en-zh-cn-zh-cn.txt --tarLanguage Chinese

Citation

If you find this project useful for your research, please kindly cite our paper:

@article{qin2023gluegen,
  title={GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation},
  author={Qin, Can and Yu, Ning and Xing, Chen and Zhang, Shu and Chen, Zeyuan and Ermon, Stefano and Fu, Yun and Xiong, Caiming and Xu, Ran},
  journal={arXiv preprint arXiv:2303.10056},
  year={2023}
}

Contact

If you have any questions, please contact Can Qin.

Acknowledgement

Stable Diffusion /~https://github.com/CompVis/stable-diffusion

AudioCLIP /~https://github.com/AndreyGuzhov/AudioCLIP

WikiMatrix /~https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlueGen

Multilingual Text to Image Generation

Sound to Image Generation

Sound-text-mix to Image Generation

Instruction for GlueGen

Environment Preparation

Download Checkpoints

Download Datasets

Running Inference Code

Running Training Code

Citation

Contact

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
checkpoints_all		checkpoints_all
data		data
figs		figs
multilingual-gluenet		multilingual-gluenet
sound-gluenet		sound-gluenet
stable-diffusion		stable-diffusion
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
download_gluenet_checkpoints.sh		download_gluenet_checkpoints.sh
download_multilingual_data.sh		download_multilingual_data.sh
download_us8k_data.sh		download_us8k_data.sh

License

salesforce/GlueGen

Folders and files

Latest commit

History

Repository files navigation

GlueGen

Multilingual Text to Image Generation

Sound to Image Generation

Sound-text-mix to Image Generation

Instruction for GlueGen

Environment Preparation

Download Checkpoints

Download Datasets

Running Inference Code

Running Training Code

Citation

Contact

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages