Skip to content

Latest commit

 

History

History
165 lines (136 loc) · 7.21 KB

guide_train.md

File metadata and controls

165 lines (136 loc) · 7.21 KB

Tutorials

HCP-Diffusion can be configured using a .yaml configuration file to specify various components that may be used for training. It includes the model structure, training parameters and methods, and dataset configuration.

Basic training configuration

The basic configuration files and examples for training are in the cfgs/train directory. All training configuration files should inherit both train_base.yaml and tuning_base.yaml. train_base.yaml defines the hyperparameters that need to be used in the training stage, and the configuration of the dataset. On the other hand, tuning_base.yaml defines the model structure and its training parameters, including which model parameters and embeddings need to be trained, and how to add Lora to layers.

The parameters in the configuration file can be modified in the cli.

accelerate launch -m hcpdiff.train_ac --cfg cfgs/train/cfg_file.yaml data.batch_size=2 seed=1919810

Dataset configuration

The configuration of the dataset contains the following items:

data:
  batch_size: 4 # batch_size for this part of data
  prompt_template: 'prompt_tuning_template/object.txt' # prompt template, works with caption and custom words
  caption_file: null # path to image captions file
  cache_latents: True # Cache the images with VAE encoding before training
  att_mask: null # Path to attention mask directory
  image_transforms: null # Image transforms and augmentation
  tag_transforms: null # prompt transforms and augmentation, fill templates
  bucket: null # Image provider, describing the way images are arranged and grouped

If there is already have .txt caption file for each image, you can convert it to .json caption file with the following command:

python -m hcpdiff.tools.convert_caption_txt2json --data_root path_to_dataset

Bucket

Bucket can group images, putting images with the same properties into the same batch.

The supported buckets:

  • FixedBucket: All images are scaled and cropped to the same specified size.
  • RatioBucket (ARB): Images are divided into groups based on their aspect ratio, and each batch can have different aspect ratios. This reduces the degradation caused by image cropping.
    • From ratios: A set of n different aspect ratios are automatically selected from the given range of aspect ratios based on the target area, and the batch is divided into buckets based on these ratios.
    • From images: Images are automatically clustered based on their aspect ratio, and n buckets are selected that have the closest aspect ratio to the target area.

Priori dataset (inspired by DreamBooth)

The pre-trained dataset is specified in the configuration file as the data_class, with all settings consistent with the data. It can be used in the DreamBooth method, or the model can learn some images it generated itself while preserving its original generation ability.

This part of the data can be generated by prompt that are randomly selected from the prompt database:

python -m hcpdiff.tools.gen_from_ptlist --model pretrained_model --prompt_file prompt_dataset.parquet --out_dir images_output_dir

prompt template usage (with tag_transforms)

The prompt template can be replaced with specified text during training. For example, a prompt template:

a photo of a {pt1} on the {pt2}, {caption}

The placeholders {pt1} and {pt2} in the prompt template will be replaced with specified words by the TemplateFill defined in the tag_transforms section. This word can be a custom embedding that occupies multiple words, or it can be a model's naive word.

For example, define the following tag_transforms:

tag_transforms:
    _target_: torchvision.transforms.Compose
    transforms:
      - _target_: hcpdiff.utils.caption_tools.TemplateFill
        word_names:
          pt1: my-cat # A custom embedding
          pt2: sofa

During training, the placeholder {pt1} will be replaced with the custom embedding of my-cat, and the placeholder {pt2} will be replaced with the naive word of sofa. The placeholder {caption} will be replaced with the caption of the corresponding image. If no caption is defined for an image, the {caption} will be left empty.

Fine-tuning configuration

Currently, Fine-tuning supports training U-Net and text-encoder. It can be used to train a part of the model, and assign different learning rates to different layers. The format is as follows:

unet:
  - # group1
    lr: 1e-6 # Learning rates for all layers in this group
    layers: # Layers to train
      # Train all layers in submodule 0 of down_blocks
      - 'down_blocks.0'
      # Support regular expressions, starting with "re:".
      # Train the GroupNorm layer in all resnet modules
      - 're:.*\.resnets.*\.norm?'
  - # group2
    lr: 3e-6
    layers:
      # Train all CrossAttention modules
      - 're:.*\.attn.?$'

text_encoder: ...

Describes which layers to train using names consistent with those named in model.named_modules().

Prompt-tuning configuration

prompt-tuning trains word embedding, and a word embedding can occupy multiple words.

First create custom words:

python -m hcpdiff.tools.create_embedding pretrained_model word_name length_of_word [--init_text initial_text]
# Random initialization: --init_text *std

Configure which words need to be trained in tokenizer_pt:

tokenizer_pt:
  emb_dir: 'embs/' # Custom words directory
  replace: False # Whether to replace the original word
  train: 
    - {name: pt1, lr: 0.003}
    - {name: pt2, lr: 0.005}

Lora configuration

In this framework, Lora can be add to any Linear or Conv2d layers to the UNet and text-encoder respectively. Its configuration is similar to Fine-tuning:

lora_unet:
  -
    lr: 1e-4
    rank: 2 # rank of Lora blocks
    dropout: 0.0
    scale: 1.0 # output=base_model + scale*lora. If it is 1.0 then the final scale is 1/rank
    svd_init: False # Initialize the Lora parameters using the SVD decomposition results of the host layer.
    layers:
      - 're:.*\.attn.?$'
  -
    lr: 5e-5
    rank: 0.1 # If the rank is a float, the final rank = round(host.out_channel * rank)
    # dropout, scale, svd_init can be omitted, use default value
    layers:
      - 're:.*\.ff\.net\.0$'

lora_text_encoder: ...

Attention mask (optional)

With few training images, it is difficult for the model to summarize what features are important. Therefore, it is possible to add an attention mask to encourage the model to focus more or less on certain features during training. As shown in the figure above.

The attention mask and the image should be placed in different folders and have the same file name.

The attention mask is a grayscale image where the grayscale values to the attention multipliers is shown in the following table:

grayscale 0% 25% 50% 75% 100%
multiplier 0% 50% 100% 300% 500%

CLIP skip

Some models skip several CLIP blocks during training. You can specify the number of CLIP blocks to skip by setting the model.clip_skip. The default value is 0 (equivalent to clip skip=1 in webui), which does not skip any blocks.