Mingkun Lei1
Xue Song2
Beier Zhu1, 3
Hao Wang4
Chi Zhang1✉
1AGI Lab, Westlake University,
2Fudan University,
3Nanyang Technological University
4The Hong Kong University of Science and Technology (Guangzhou)
- [2024.12.12] 🔥🔥We release the code.
- [2024.12.19] 📝📝We have summarized the recent developments in style transfer. And we will continue to update.
Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
git clone /~https://github.com/Westlake-AGI-Lab/StyleStudio
cd StyleStudio
# create env using conda
conda create -n StyleStudio python=3.10
conda activate StyleStudio
# install dependencies with pip
# for Linux and Windows users
pip install -r requirements.txt
Please note: Our solution is designed to be fine-tuning free and can be combined with different methods.
adainIP
using the cross modal AdaINfuSAttn
hijack Self-Attention Map in the Teacher ModelfuAttn
hijack Cross-Attention Map in the Teacher Modelend_fusion
define when the Teacher Model stops participatingprompt
specified prompt for generating the imagestyle_path
path to the style image or folderneg_style_path
path to the negative style image
Follow CSGO to download pre-trained checkpoints.
This is an example of usage: as the value of end_fusion
increases, the style gradually diminishes. If the num_inference_steps
are set to 50, we recommend setting end_fusion
between 10 and 20. Typically, end_fusion
should be set within the first 1/5 to 1/3 of the total num_inference_steps
.
If you find that layout stability is not satisfactory, consider increasing the duration of the Teacher Model's involvement.
# Generate a single stylized image
# Use a specific text prompt and style image path
python infer_StyleStudio.py \
--prompt "A red apple" \
--style_path "assets/style1.jpg" \
--adainIP \ # Enable Cross-Modal AdaIN
--fuSAttn \ # Enable Teacher Model with Self Attention Map
--end_fusion 20 \ # Define when the Teacher Model stop participating
--num_inference_steps 50
# Check layout stability across different style images
# With the same text prompt and a set of style images
python infer_StyleStudio_layout_stability.py \
--prompt "A red apple" \
--style_path "path/to/style_images_folder" \
--adainIP \ # Enable Cross-Modal AdaIN
--fuSAttn \ # Enable Teacher Model with Self Attention Map
--end_fusion 20 \ # Define when the Teacher Model stop participating
--num_inference_steps 50
- As shown in Figure 15 of the paper, employing a Cross Attention Map in the Teacher Model does not ensure layout stability. We have also provided an interface
fuAttn
and encourage everyone to experiment with it. - To ensure layout stability and consistency for the same prompt under different style images, it is important to maintain consistency in the initial noise
$z_0$ during experiments. For more details on this aspect, refer toinfer_StyleStudio_layout_stability.py
.
This is an example of using Style-based Classifier-Free Guidance.
python infer_StyleStudio.py \
--prompt "A red apple" \
--style_path "assets/style2.jpg" \
--neg_style_path "assets/neg_style2.jpg" \
Some recommendations for generating Negative Style Images.
- You can use ControlNet Canny for generation.
- To ensure the generated images are more realistic, you can use weights from Civitai or Huggingface that are better suited for generating realistic image effects. We use the RealVisXL_V4.0.
To generate negative style images, we provide a code implementation in example_create_neg_style.py
for your reference.
Follow InstantStyle to download pre-trained checkpoints.
python infer_InstantStyle.py \
--prompt "A red apple" \
--style_path "assets/style1.jpg" \
--adainIP \ # Enable Cross-Modal AdaIN
--fuSAttn \ # Enable Teacher Model with Self Attention Map
--end_fusion 20 \ # Define when the Teacher Model stop participating
--num_inference_steps 50
Follow StyleCrafter to download pre-trained checkpoints.
We encourage you to integrate the Teacher Model with StyleCrafter. This combination, as shown in our experiments, not only helps maintain layout stability but also effectively reduces content leakage.
cd stylecrafter_sdxl
python stylecrafter_teacherModel.py \
--config config/infer/style_crafter_sdxl.yaml \
--style_path "../assets/style1.jpg" \
--prompt "A red apple" \
--scale 0.5 \
--num_samples 2 \
--end_fusion 10 # Define when the Teacher Model stop participating
To run a local demo of the project, run the following:
python gradio/app.py
- Style Transfer with Diffusion Models: A paper collection of recent style transfer methods with diffusion models.
- CSGO: Content-Style Composition in Text-to-Image Generation
- InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
- StyleCrafter-SDXL
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
If you find our repo helpful, please consider leaving a star or cite our paper :)
@misc{lei2024stylestudiotextdrivenstyletransfer,
title={StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements},
author={Mingkun Lei and Xue Song and Beier Zhu and Hao Wang and Chi Zhang},
year={2024},
eprint={2412.08503},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.08503},
}
If you have any comments or questions, feel free to contact Mingkun Lei.