HCP-Diffusion can be configured using a .yaml
configuration file to specify various components that may be used for training.
It includes the model structure, training parameters and methods, and dataset configuration.
The basic configuration files and examples for training are in the cfgs/train
directory.
All training configuration files should inherit both train_base.yaml
and tuning_base.yaml
.
train_base.yaml
defines the hyperparameters that need to be used in the training stage, and the configuration of the dataset.
On the other hand, tuning_base.yaml
defines the model structure and its training parameters,
including which model parameters and embeddings need to be trained, and how to add Lora to layers.
The parameters in the configuration file can be modified in the cli.
accelerate launch -m hcpdiff.train_ac --cfg cfgs/train/cfg_file.yaml data.batch_size=2 seed=1919810
The configuration of the dataset contains the following items:
data:
batch_size: 4 # batch_size for this part of data
prompt_template: 'prompt_tuning_template/object.txt' # prompt template, works with caption and custom words
caption_file: null # path to image captions file
cache_latents: True # Cache the images with VAE encoding before training
att_mask: null # Path to attention mask directory
image_transforms: null # Image transforms and augmentation
tag_transforms: null # prompt transforms and augmentation, fill templates
bucket: null # Image provider, describing the way images are arranged and grouped
If there is already have .txt
caption file for each image,
you can convert it to .json
caption file with the following command:
python -m hcpdiff.tools.convert_caption_txt2json --data_root path_to_dataset
Bucket can group images, putting images with the same properties into the same batch.
The supported buckets:
- FixedBucket: All images are scaled and cropped to the same specified size.
- RatioBucket (ARB): Images are divided into groups based on their aspect ratio, and each batch can have different aspect ratios. This reduces the degradation caused by image cropping.
- From ratios: A set of n different aspect ratios are automatically selected from the given range of aspect ratios based on the target area, and the batch is divided into buckets based on these ratios.
- From images: Images are automatically clustered based on their aspect ratio, and n buckets are selected that have the closest aspect ratio to the target area.
The pre-trained dataset is specified in the configuration file as the data_class
, with all settings consistent with the data
.
It can be used in the DreamBooth method, or the model can learn some images it generated itself while preserving its original generation ability.
This part of the data can be generated by prompt that are randomly selected from the prompt database:
python -m hcpdiff.tools.gen_from_ptlist --model pretrained_model --prompt_file prompt_dataset.parquet --out_dir images_output_dir
The prompt template can be replaced with specified text during training. For example, a prompt template:
a photo of a {pt1} on the {pt2}, {caption}
The placeholders {pt1}
and {pt2}
in the prompt template will be replaced with specified words by the TemplateFill
defined in the tag_transforms
section.
This word can be a custom embedding that occupies multiple words, or it can be a model's naive word.
For example, define the following tag_transforms
:
tag_transforms:
_target_: torchvision.transforms.Compose
transforms:
- _target_: hcpdiff.utils.caption_tools.TemplateFill
word_names:
pt1: my-cat # A custom embedding
pt2: sofa
During training, the placeholder {pt1}
will be replaced with the custom embedding of my-cat
,
and the placeholder {pt2}
will be replaced with the naive word of sofa
.
The placeholder {caption}
will be replaced with the caption of the corresponding image.
If no caption is defined for an image, the {caption}
will be left empty.
Currently, Fine-tuning supports training U-Net and text-encoder. It can be used to train a part of the model, and assign different learning rates to different layers. The format is as follows:
unet:
- # group1
lr: 1e-6 # Learning rates for all layers in this group
layers: # Layers to train
# Train all layers in submodule 0 of down_blocks
- 'down_blocks.0'
# Support regular expressions, starting with "re:".
# Train the GroupNorm layer in all resnet modules
- 're:.*\.resnets.*\.norm?'
- # group2
lr: 3e-6
layers:
# Train all CrossAttention modules
- 're:.*\.attn.?$'
text_encoder: ...
Describes which layers to train using names consistent with those named in model.named_modules()
.
prompt-tuning trains word embedding, and a word embedding can occupy multiple words.
First create custom words:
python -m hcpdiff.tools.create_embedding pretrained_model word_name length_of_word [--init_text initial_text]
# Random initialization: --init_text *std
Configure which words need to be trained in tokenizer_pt
:
tokenizer_pt:
emb_dir: 'embs/' # Custom words directory
replace: False # Whether to replace the original word
train:
- {name: pt1, lr: 0.003}
- {name: pt2, lr: 0.005}
In this framework, Lora can be add to any Linear or Conv2d layers to the UNet and text-encoder respectively. Its configuration is similar to Fine-tuning:
lora_unet:
-
lr: 1e-4
rank: 2 # rank of Lora blocks
dropout: 0.0
scale: 1.0 # output=base_model + scale*lora. If it is 1.0 then the final scale is 1/rank
svd_init: False # Initialize the Lora parameters using the SVD decomposition results of the host layer.
layers:
- 're:.*\.attn.?$'
-
lr: 5e-5
rank: 0.1 # If the rank is a float, the final rank = round(host.out_channel * rank)
# dropout, scale, svd_init can be omitted, use default value
layers:
- 're:.*\.ff\.net\.0$'
lora_text_encoder: ...
With few training images, it is difficult for the model to summarize what features are important. Therefore, it is possible to add an attention mask to encourage the model to focus more or less on certain features during training. As shown in the figure above.
The attention mask and the image should be placed in different folders and have the same file name.
The attention mask is a grayscale image where the grayscale values to the attention multipliers is shown in the following table:
grayscale | 0% | 25% | 50% | 75% | 100% |
---|---|---|---|---|---|
multiplier | 0% | 50% | 100% | 300% | 500% |
Some models skip several CLIP blocks during training.
You can specify the number of CLIP blocks to skip by setting the model.clip_skip
.
The default value is 0 (equivalent to clip skip=1 in webui), which does not skip any blocks.