Skip to content

Latest commit

 

History

History

diffusion

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

[Website][Paper][Nunchaku Inference System]

Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive to quantization, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first shift the outliers from activations into the weights, then employ a high-precision low-rank branch to take in the outliers in the weights with SVD. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels in the low-rank branch into thosein the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the requantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.6×, achieving 3.5× speedup over the 4-bit weight-only quantized baseline on a 16GB RTX-4090 GPU, paving the way for more interactive applications on PCs.

Teaser SVDQuant

Usage

We use Flux.1-schnell as an example.

Step 1: Evaluation Baselines Preparation

In order to evaluate the similarity metrics, we have to first prepare the reference images generated by unquantized models by running the following command:

python -m deepcompressor.app.diffusion.ptq configs/model/flux.1-schnell.yaml --output-dirname reference

In this command,

  • configs/model/flux.1-schnell.yaml specifies the model configurations including evaluation setups.
  • By setting flag --output-dirname to reference, the output directory will be automatically redirect to the ref_root in the evaluation configuration.

Step 2: Calibration Dataset Preparation

Before quantizing diffusion models, we randomly sample 128 prompts in COCO Captions 2024 to generate calibration dataset by running the following command:

python -m deepcompressor.app.diffusion.dataset.collect.calib \
    configs/model/flux.1-schnell.yaml configs/collect/qdiff.yaml

In this command,

  • configs/collect/qdiff.yaml specifies the calibration dataset configurations, including the path to the prompt yaml (i.e., --collect-prompt-path prompts/qdiff.yaml), the number of prompts to be sampled (i.e., --collect-num-samples 128), and the root directory of the calibration datasets (which should be in line with the quantization configuration).

Step 3: Model Quantization

The following command will perform INT4 SVDQuant and evaluate the quantized model on 1024 samples from MJHQ-30K:

python -m deepcompressor.app.diffusion.ptq \
    configs/model/flux.1-schnell.yaml configs/svdquant/int4.yaml \
    --eval-benchmarks MJHQ --eval-num-samples 1024

In this command,

  • The positional arguments are configuration files which are loaded in order. configs/svdquant/int4.yaml contains the quantization configurations specialized in INT4 SVDQuant. Please make sure all configuration files are under a subfolder of the working directory where you run the command.
    python -m deepcompressor.app.diffusion.ptq \
        configs/model/flux.1-schnell.yaml configs/svdquant/int4.yaml configs/svdquant/fast.yaml \
        --eval-benchmarks MJHQ --eval-num-samples 1024
    python -m deepcompressor.app.diffusion.ptq \
        configs/model/flux.1-schnell.yaml configs/svdquant/int4.yaml configs/svdquant/gptq.yaml \
        --eval-benchmarks MJHQ --eval-num-samples 1024
  • All configurations can be directly set in either YAML file or command line. Please refer to configs/__default__.yaml and python -m deepcompressor.app.diffusion.ptq -h.
  • The default evaluation datasets are 1024 samples from MJHQ and DCI.
  • If you would like to save quantized model checkpoint, please add --save-model true or --save-model /PATH/TO/CHECKPOINT/DIR in the command.

Deployment

If you save the SVDQuant W4A4 quantized model checkpoint, you can easily to deploy quantized model with Nunchaku engine.

Please run the following command to convert the saved checkpoint to Nunchaku-compatible checkpoint:

python -m deepcompressor.backend.nunchaku.convert \
  --quant-path /PATH/TO/CHECKPOINT/DIR \
  --output-root /PATH/TO/OUTPUT/ROOT \
  --model-name MODEL_NAME

After we have the Nunchaku-compatible checkpoint, please switch to Nunchaku conda environment and refer to Nunchaku for further deployment on GPU system.

If you want to integrate LoRA, please run the following command to convert LoRA to Nunchaku-compatible checkpoint:

python -m deepcompressor.backend.nunchaku.convert_lora \
  --quant-path /PATH/TO/NUNCHAKU/TRANSFORMER_BLOCKS/SAFETENSORS_FILE \
  --lora-path /PATH/TO/DIFFUSERS/LORA/SAFETENSORS_FILE \
  --output-root /PATH/TO/OUTPUT/ROOT \
  --lora-name LORA_NAME

Evaluation Resutls

Quality Evaluation

Below is the quality and similarity evaluated with 5000 samples from MJHQ-30K dataset. IR means ImageReward. Our 4-bit results outperform other 4-bit baselines, effectively preserving the visual quality of 16-bit models.

Model Precision Method FID ($\downarrow$) IR ($\uparrow$) LPIPS ($\downarrow$) PSNR( $\uparrow$)
FLUX.1-dev (50 Steps) BF16 -- 20.3 0.953 -- --
INT W8A8 SVDQ 20.4 0.948 0.089 27.0
W4A16 NF4 20.6 0.910 0.272 19.5
INT W4A4 20.2 0.908 0.322 18.5
INT W4A4 SVDQ 20.1 0.926 0.256 20.1
INT W4A4 SVDQ+GPTQ 19.9 0.935 0.223 21.0
NVFP4 20.3 0.961 0.345 16.3
NVFP4 SVDQ 20.7 0.934 0.222 21.0
NVFP4 SVDQ+GPTQ 20.3 0.942 0.205 21.5
FLUX.1-schnell (4 Steps) BF16 -- 19.2 0.938 -- --
INT W8A8 SVDQ 19.2 0.966 0.120 22.9
W4A16 NF4 18.9 0.943 0.257 18.2
INT W4A4 18.1 0.962 0.345 16.3
INT W4A4 SVDQ 18.3 0.957 0.289 17.6
INT W4A4 SVDQ+GPTQ 18.3 0.951 0.257 18.3
NVFP4 19.0 0.952 0.276 17.6
NVFP4 SVDQ 19.0 0.976 0.247 18.4
NVFP4 SVDQ+GPTQ 18.9 0.964 0.229 19.0
SANA-1.6b (20 Steps) BF16 -- 20.6 0.952 -- --
INT W4A4 20.5 0.894 0.339 15.3
INT W4A4 GPTQ 19.9 0.881 0.288 16.4
INT W4A4 SVDQ 19.9 0.922 0.234 17.4
INT W4A4 SVDQ+GPTQ 19.3 0.935 0.220 17.8
NVFP4 19.7 0.929 0.236 17.4
NVFP4 GPTQ 19.7 0.925 0.202 18.3
NVFP4 SVDQ 20.2 0.951 0.190 18.6
NVFP4 SVDQ+GPTQ 20.2 0.941 0.176 19.0
PixArt-Sigma (20 Steps) FP16 -- 16.6 0.944 -- --
INT W8A8 ViDiT-Q 15.7 0.944 0.137 22.5
INT W8A8 SVDQ 16.3 0.955 0.109 23.7
INT W4A8 ViDiT-Q 37.3 0.573 0.611 12.0
INT W4A4 SVDQ 19.9 0.858 0.356 17.0
INT W4A4 SVDQ+GPTQ 19.2 0.878 0.323 17.6
NVFP4 31.8 0.660 0.517 14.8
NVFP4 GPTQ 27.2 0.691 0.482 15.6
NVFP4 SVDQ 17.3 0.945 0.290 18.0
NVFP4 SVDQ+GPTQ 16.6 0.940 0.271 18.5

Reference

If you find deepcompressor useful or relevant to your research, please kindly cite our paper:

@inproceedings{
  li2024svdquant,
  title={SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models},
  author={Li*, Muyang and Lin*, Yujun and Zhang*, Zhekai and Cai, Tianle and Li, Xiuyu and Guo, Junxian and Xie, Enze and Meng, Chenlin and Zhu, Jun-Yan and Han, Song},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025}
}