Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained LoRAs not working #26

Closed
caniyabanci76 opened this issue Jan 11, 2025 · 4 comments
Closed

Trained LoRAs not working #26

caniyabanci76 opened this issue Jan 11, 2025 · 4 comments

Comments

@caniyabanci76
Copy link

caniyabanci76 commented Jan 11, 2025

LoRAs trained with musubi-tuner, when used with ComfyUI's native workflow, I get the following in the terminal and the trained likeness is not there. This doesn't happen with the diffusion-pipe trainer. Am I doing something wrong?

lora key not loaded: lora_unet_double_blocks_0_img_mlp_fc1.alpha
...etc
@Sarania
Copy link

Sarania commented Jan 11, 2025

Did you use the convert_lora.py to convert it?

python convert_lora.py --input lora.safetensors --output converted_lora.safetensors --target other

As I understand you have to do this if you want to use it with the native ComfyUI nodes.

@caniyabanci76
Copy link
Author

ahh thank you! i really should read the whole readme before asking! 🙏🏻

Did you use the convert_lora.py to convert it?

python convert_lora.py --input lora.safetensors --output converted_lora.safetensors --target other

As I understand you have to do this if you want to use it with the native ComfyUI nodes.

@PiDrawings
Copy link

I would like to reopen this thread.
Having similar issue, have trained the lora (11 images, 512x512 with text files). 15 epochs 20 repeats each image. at the end I've converted Lora.

On side of comfyui not having any bad info in command lone, but the output video is not impacted at all by the lora (have tested few epochs and without any lora). video is the same and character is not similar to the one I've trained. Using Musubi with Gui, below information from training, I'm not that technical but cannot see anything else that is failing other the sageattention, which i assume should not impact the training.

10:22:14-594978 INFO Save configuration...
10:22:14-595971 INFO Save...
10:23:13-219875 INFO Train model...
10:23:13-220875 INFO Executing command: ['uv', 'run', './musubi-tuner/cache_latents.py', '--dataset_config',
'D:/Msubi/musubi-tuner/dataset/P4ulink4.toml', '--vae', 'D:/Msubi/musubi-tuner/NeuralNet/Vae.pt',
'--vae_dtype', 'float16', '--vae_chunk_size', '32', '--vae_tiling', '--vae_spatial_tile_sample_min_size',
'256', '--device', 'cuda', '--batch_size', '1', '--skip_existing', '--keep_cache', '--console_width', '80']
10:23:13-222874 INFO Caching latents...
INFO:main:Load dataset config from D:/Msubi/musubi-tuner/dataset/P4ulink4.toml
INFO:dataset.image_video_dataset:glob images in D:/Msubi/musubi-tuner/dataset/P4ulink4
INFO:dataset.image_video_dataset:found 11 images
INFO:dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (512, 512)
batch_size: 1
num_repeats: 20
caption_extension: ".txt"
enable_bucket: True
bucket_no_upscale: False
cache_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4_cache_2"
debug_dataset: False
image_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4"
image_jsonl_file: "None"

INFO:hunyuan_model.vae:Loading 3D VAE model (884-16c-hy) from: D:/Msubi/musubi-tuner/NeuralNet/Vae.pt
INFO:hunyuan_model.vae:VAE to dtype: torch.float16
INFO:main:Loaded VAE: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types', ['DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D']), ('up_block_types', ['UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2), ('act_fn', 'silu'), ('latent_channels', 16), ('norm_num_groups', 32), ('sample_size', 256), ('sample_tsize', 64), ('scaling_factor', 0.476986), ('force_upcast', True), ('spatial_compression_ratio', 8), ('time_compression_ratio', 4), ('mid_block_add_attention', True), ('_use_default_values', ['force_upcast', 'spatial_compression_ratio']), ('_class_name', 'AutoencoderKLCausal3D'), ('_diffusers_version', '0.4.2')]), dtype: torch.float16
INFO:main:Set chunk_size to 32 for CausalConv3d in VAE
INFO:main:Encoding dataset [0]
11it [00:03, 3.04it/s]
10:23:35-965889 INFO Executing command: ['uv', 'run', './musubi-tuner/cache_text_encoder_outputs.py', '--dataset_config',
'D:/Msubi/musubi-tuner/dataset/P4ulink4.toml', '--text_encoder1',
'D:\musubi-tuner-gui\musubi-tuner\NeuralNet\llava_llama3_fp16.safetensors', '--text_encoder2',
'D:/Msubi/musubi-tuner/NeuralNet/clip_l.safetensors', '--fp8_llm', '--device', 'cuda',
'--text_encoder_dtype', 'bfloat16', '--batch_size', '1', '--skip_existing']
10:23:35-966887 INFO Caching text encoder outputs...
INFO:main:Load dataset config from D:/Msubi/musubi-tuner/dataset/P4ulink4.toml
INFO:dataset.image_video_dataset:glob images in D:/Msubi/musubi-tuner/dataset/P4ulink4
INFO:dataset.image_video_dataset:found 11 images
INFO:dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (512, 512)
batch_size: 1
num_repeats: 20
caption_extension: ".txt"
enable_bucket: True
bucket_no_upscale: False
cache_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4_cache_2"
debug_dataset: False
image_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4"
image_jsonl_file: "None"

INFO:main:loading text encoder 1: D:\musubi-tuner-gui\musubi-tuner\NeuralNet\llava_llama3_fp16.safetensors
INFO:hunyuan_model.text_encoder:Loading text encoder model (llm) from: D:\musubi-tuner-gui\musubi-tuner\NeuralNet\llava_llama3_fp16.safetensors
INFO:hunyuan_model.text_encoder:Text encoder to dtype: torch.bfloat16
INFO:hunyuan_model.text_encoder:Loading tokenizer (llm) from: D:\musubi-tuner-gui\musubi-tuner\NeuralNet\llava_llama3_fp16.safetensors
INFO:hunyuan_model.text_encoder:Loading tokenizer from Hugging Face: xtuner/llava-llama-3-8b-v1_1-transformers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO:hunyuan_model.text_encoder:Moving and casting text encoder to cuda and torch.float8_e4m3fn
INFO:main:Encoding with Text Encoder 1
INFO:main:Encoding dataset [0]
11it [00:03, 3.35it/s]
INFO:main:loading text encoder 2: D:/Msubi/musubi-tuner/NeuralNet/clip_l.safetensors
INFO:hunyuan_model.text_encoder:Loading text encoder model (clipL) from: D:/Msubi/musubi-tuner/NeuralNet/clip_l.safetensors
INFO:hunyuan_model.text_encoder:Text encoder to dtype: torch.bfloat16
INFO:hunyuan_model.text_encoder:Loading tokenizer (clipL) from: D:/Msubi/musubi-tuner/NeuralNet/clip_l.safetensors
INFO:hunyuan_model.text_encoder:Loading tokenizer from Hugging Face: openai/clip-vit-large-patch14
INFO:main:Encoding with Text Encoder 2
INFO:main:Encoding dataset [0]
11it [00:00, 30.82it/s]
10:23:57-217195 INFO Saving training config to
D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2_20250122-102357.toml...
10:23:57-218193 INFO Creating folder D:/Msubi/musubi-tuner/Output/P4ulink4_test2 for the configuration file...
10:23:57-220194 INFO Executing command: uv run accelerate launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16
--num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2
D:/musubi-tuner-gui/musubi-tuner/hv_train_network.py --config_file
D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2_20250122-102357.toml
Trying to import sageattention
Failed to import sageattention
INFO:main:Loading settings from D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2_20250122-102357.toml...
INFO:main:D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2_20250122-102357
INFO:main:Load dataset config from D:/Msubi/musubi-tuner/dataset/P4ulink4.toml
INFO:dataset.image_video_dataset:glob images in D:/Msubi/musubi-tuner/dataset/P4ulink4
INFO:dataset.image_video_dataset:found 11 images
INFO:dataset.config_utils:[Dataset 0]
is_image_dataset: True
resolution: (512, 512)
batch_size: 1
num_repeats: 20
caption_extension: ".txt"
enable_bucket: True
bucket_no_upscale: False
cache_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4_cache_2"
debug_dataset: False
image_directory: "D:/Msubi/musubi-tuner/dataset/P4ulink4"
image_jsonl_file: "None"

INFO:dataset.image_video_dataset:bucket: (512, 512), count: 220
INFO:dataset.image_video_dataset:total batches: 220
INFO:main:preparing accelerator
accelerator device: cuda
INFO:main:DiT precision: torch.float16, weight precision: torch.float8_e4m3fn
INFO:main:Loading DiT model from D:/Msubi/musubi-tuner/NeuralNet/mp_rank_00_model_states_fp8.safetensors
Using torch attention mode, split_attn: False
import network module: networks.lora
INFO:networks.lora:create LoRA network. base dim (rank): 32, alpha: 1
INFO:networks.lora:neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
INFO:networks.lora:create LoRA for U-Net/DiT: 240 modules.
INFO:networks.lora:enable LoRA for U-Net: 240 modules
HYVideoDiffusionTransformer: Gradient checkpointing enabled.
prepare optimizer, data loader etc.
INFO:main:use 8-bit AdamW optimizer | {}
override steps. steps for 15 epochs is / 指定エポックまでのステップ数: 3300
INFO:main:casting model to torch.float8_e4m3fn
running training / 学習開始
num train items / 学習画像、動画数: 220
num batches per epoch / 1epochのバッチ数: 220
num epochs / epoch数: 15
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 3300
INFO:main:calculate hash for DiT model: D:/Msubi/musubi-tuner/NeuralNet/mp_rank_00_model_states_fp8.safetensors
INFO:main:calculate hash for VAE model: D:/Msubi/musubi-tuner/NeuralNet/Vae.pt
steps: 0%| | 0/3300 [00:00<?, ?it/s]INFO:main:DiT dtype: torch.float8_e4m3fn, device: cuda:0

epoch 1/15
Trying to import sageattention
Trying to import sageattention
Failed to import sageattention
Failed to import sageattention
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 0, epoch: 1
steps: 7%|████▌ | 220/3300 [10:45<2:30:43, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000001.safetensors

epoch 2/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 1, epoch: 2
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 1, epoch: 2
steps: 13%|█████████▏ | 440/3300 [21:33<2:20:06, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000002.safetensors

epoch 3/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 2, epoch: 3
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 2, epoch: 3
steps: 20%|█████████████▊ | 660/3300 [32:17<2:09:08, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000003.safetensors

epoch 4/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 3, epoch: 4
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 3, epoch: 4
steps: 27%|██████████████████▍ | 880/3300 [42:57<1:58:08, 2.93s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000004.safetensors

epoch 5/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 4, epoch: 5
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 4, epoch: 5
steps: 33%|██████████████████████▋ | 1100/3300 [53:43<1:47:26, 2.93s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000005.safetensors

epoch 6/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 5, epoch: 6
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 5, epoch: 6
steps: 40%|██████████████████████████▍ | 1320/3300 [1:04:32<1:36:48, 2.93s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000006.safetensors

epoch 7/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 6, epoch: 7
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 6, epoch: 7
steps: 47%|██████████████████████████████▊ | 1540/3300 [1:15:30<1:26:17, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000007.safetensors

epoch 8/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 7, epoch: 8
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 7, epoch: 8
steps: 53%|███████████████████████████████████▏ | 1760/3300 [1:26:15<1:15:28, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000008.safetensors

epoch 9/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 8, epoch: 9
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 8, epoch: 9
steps: 60%|███████████████████████████████████████▌ | 1980/3300 [1:36:55<1:04:36, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000009.safetensors

epoch 10/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 9, epoch: 10
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 9, epoch: 10
steps: 67%|█████████████████████████████████████████████▎ | 2200/3300 [1:47:41<53:50, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000010.safetensors

epoch 11/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 10, epoch: 11
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 10, epoch: 11
steps: 73%|█████████████████████████████████████████████████▊ | 2420/3300 [1:58:36<43:07, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000011.safetensors

epoch 12/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 11, epoch: 12
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 11, epoch: 12
steps: 80%|██████████████████████████████████████████████████████▍ | 2640/3300 [2:09:31<32:22, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000012.safetensors

epoch 13/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 12, epoch: 13
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 12, epoch: 13
steps: 87%|██████████████████████████████████████████████████████████▉ | 2860/3300 [2:20:21<21:35, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000013.safetensors

epoch 14/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 13, epoch: 14
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 13, epoch: 14
steps: 93%|███████████████████████████████████████████████████████████████▍ | 3080/3300 [2:31:06<10:47, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2-000014.safetensors

epoch 15/15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 14, epoch: 15
INFO:dataset.image_video_dataset:epoch is incremented. current_epoch: 14, epoch: 15
steps: 100%|████████████████████████████████████████████████████████████████████| 3300/3300 [2:41:46<00:00, 2.94s/it, avr_loss=nan]
saving checkpoint: D:/Msubi/musubi-tuner/Output/P4ulink4_test2\P4ulink4_512_test_2.safetensors
INFO:main:model saved.
steps: 100%|████████████████████████████████████████████████████████████████████| 3300/3300 [2:41:46<00:00, 2.94s/it, avr_loss=nan]
13:05:57-021080 INFO Training has ended.

@PiDrawings
Copy link

found where the issue was, using the float16 instead of bf16 which causend in arg_loss: nan it is working fine with bf16 settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants