[Bug] FlashAttention Error During InternVL Fine-tuning on Tesla T4 GPU #918

kachhadiyaraj15 · 2025-02-21T16:17:00Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am trying to fine-tune the InternVL model on Google Colab using a Tesla T4 GPU. However, I am encountering the following error:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Reproduction

To resolve this, I have already set the parameter in the config file as "eager", but the issue persists. I have attached the relevant file with the name "flash_attention_error" for reference. Click Here

Could you please cross-check and provide guidance on resolving this issue?

Environment

Environment Details:

Platform: Google Colab
GPU: Tesla T4
Configuration: flash_attention = "eager" in the config file

Error traceback

+ GPUS=1
+ BATCH_SIZE=16
+ PER_DEVICE_BATCH_SIZE=1
+ GRADIENT_ACC=16
+ pwd
+ export PYTHONPATH=/env/python:/content/InternVL/internvl_chat
+ export MASTER_PORT=34229
+ export TF_CPP_MIN_LOG_LEVEL=3
+ export LAUNCHER=pytorch
+ OUTPUT_DIR=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora
+ [ ! -d work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora ]
+ tee -a work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/training_log.txt
+ torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=1 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path OpenGVLab/InternVL2_5-1B --conv_style internvl2_5 --use_fast_tokenizer False --output_dir work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora --meta_path /content/InternVL/internvl_chat/shell/data/train_metadata.jsonl --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 8192 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard
[2025-02-21 15:58:45,181] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740153530.347317   11210 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740153530.353577   11210 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[2025-02-21 15:58:52,771] [INFO] [comm.py:658:init_distributed] cdb=None
[2025-02-21 15:58:52,771] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
02/21/2025 15:58:52 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
02/21/2025 15:58:52 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage1_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=4e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora/runs/Feb21_15-58-52_a33d935af6d7,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.COSINE,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v2_5/internvl2_5_1b_dynamic_res_2nd_finetune_lora,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=200,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=1,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.01,
)
02/21/2025 15:58:52 - INFO - __main__ - Loading Tokenizer: OpenGVLab/InternVL2_5-1B
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file vocab.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/vocab.json
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file merges.txt from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/merges.txt
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/added_tokens.json
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/special_tokens_map.json
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/tokenizer_config.json
[INFO|tokenization_utils_base.py:2027] 2025-02-21 15:58:53,096 >> loading file tokenizer.json from cache at None
[WARNING|logging.py:314] 2025-02-21 15:58:53,382 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
02/21/2025 15:58:53 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:729] 2025-02-21 15:58:53,497 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/config.json
[INFO|configuration_utils.py:792] 2025-02-21 15:58:53,498 >> Model config InternVLChatConfig {
  "_commit_hash": "f27984381d99c1f2da11989d3216ca7b5bb51721",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "OpenGVLab/InternVL2_5-1B--configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "OpenGVLab/InternVL2_5-1B--modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 896,
  "llm_config": {
    "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151645,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 896,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 4864,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 21,
    "min_length": 0,
    "model_type": "qwen2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 14,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 24,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": 32768,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true
  }
}

02/21/2025 15:58:53 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3476] 2025-02-21 15:58:53,500 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/model.safetensors
[INFO|modeling_utils.py:1426] 2025-02-21 15:58:53,519 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:826] 2025-02-21 15:58:53,520 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2025-02-21 15:58:53,576 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645
}

[INFO|modeling_utils.py:4350] 2025-02-21 15:58:56,056 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-02-21 15:58:56,056 >> All the weights of InternVLChatModel were initialized from the model checkpoint at OpenGVLab/InternVL2_5-1B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:781] 2025-02-21 15:58:56,168 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--OpenGVLab--InternVL2_5-1B/snapshots/f27984381d99c1f2da11989d3216ca7b5bb51721/generation_config.json
[INFO|configuration_utils.py:826] 2025-02-21 15:58:56,169 >> Generate config GenerationConfig {
  "eos_token_id": [
    151644,
    151645,
    151643
  ]
}

02/21/2025 15:58:56 - INFO - __main__ - Finished
02/21/2025 15:58:56 - INFO - __main__ - model.config.force_image_size: 448
02/21/2025 15:58:56 - INFO - __main__ - data_args.force_image_size: 448
02/21/2025 15:58:56 - INFO - __main__ - model.config.vision_config.image_size: 448
02/21/2025 15:58:56 - INFO - __main__ - [Dataset] num_image_token: 256
02/21/2025 15:58:56 - INFO - __main__ - [Dataset] dynamic_image_size: True
02/21/2025 15:58:56 - INFO - __main__ - [Dataset] use_thumbnail: True
02/21/2025 15:58:56 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
02/21/2025 15:58:56 - INFO - __main__ - Formatting inputs...Skip in lazy mode
02/21/2025 15:58:56 - INFO - __main__ - Add dataset: data with length: 100
trainable params: 8,798,208 || all params: 638,496,128 || trainable%: 1.3779579255334184
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.k_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.1.self_attn.o_proj.lora_B.default.weight
...........
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.up_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.20.mlp.down_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.q_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.k_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.v_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.self_attn.o_proj.lora_B.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.21.mlp.gate_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - 02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_A.default.weight
02/21/2025 15:58:57 - INFO - __main__ - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_B.default.weight
[INFO|trainer.py:571] 2025-02-21 15:58:57,196 >> Using auto half precision backend
[2025-02-21 15:58:57,725] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown
[2025-02-21 15:58:57,726] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 1
[2025-02-21 15:58:58,800] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.05335521697998047 seconds
[2025-02-21 15:58:58,858] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-02-21 15:58:58,859] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-02-21 15:58:58,888] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-02-21 15:58:58,888] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-02-21 15:58:58,888] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2025-02-21 15:58:58,889] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000
[2025-02-21 15:58:58,889] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000
[2025-02-21 15:58:58,889] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2025-02-21 15:58:58,889] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2025-02-21 15:58:59,291] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2025-02-21 15:58:59,291] [INFO] [utils.py:782:see_memory_usage] MA 1.99 GB         Max_MA 2.01 GB         CA 2.13 GB         Max_CA 2 GB 
[2025-02-21 15:58:59,292] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.43 GB, percent = 42.9%
[2025-02-21 15:58:59,653] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2025-02-21 15:58:59,654] [INFO] [utils.py:782:see_memory_usage] MA 1.99 GB         Max_MA 2.03 GB         CA 2.16 GB         Max_CA 2 GB 
[2025-02-21 15:58:59,654] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.43 GB, percent = 42.9%
[2025-02-21 15:58:59,654] [INFO] [stage_1_and_2.py:550:__init__] optimizer state initialized
[2025-02-21 15:59:00,007] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2025-02-21 15:59:00,008] [INFO] [utils.py:782:see_memory_usage] MA 1.99 GB         Max_MA 1.99 GB         CA 2.16 GB         Max_CA 2 GB 
[2025-02-21 15:59:00,008] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 5.43 GB, percent = 42.9%
[2025-02-21 15:59:00,014] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2025-02-21 15:59:00,015] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-02-21 15:59:00,015] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7aff197db090>
[2025-02-21 15:59:00,015] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-02-21 15:59:00,020] [INFO] [config.py:1001:print] DeepSpeedEngine configuration:
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   amp_enabled .................. False
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   amp_params ................... False
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-02-21 15:59:00,021] [INFO] [config.py:1005:print]   bfloat16_enabled ............. True
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   bfloat16_immediate_grad_update  False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   checkpoint_parallel_write_pipeline  False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   checkpoint_tag_validation_enabled  True
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   checkpoint_tag_validation_fail  False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7aff197ee2d0>
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   communication_data_type ...... None
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   curriculum_enabled_legacy .... False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   curriculum_params_legacy ..... False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   data_efficiency_enabled ...... False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   dataloader_drop_last ......... False
[2025-02-21 15:59:00,022] [INFO] [config.py:1005:print]   disable_allgather ............ False
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   dump_state ................... False
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   dynamic_loss_scale_args ...... None
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_enabled ........... False
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_gas_boundary_resolution  1
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_layer_num ......... 0
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_max_iter .......... 100
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_stability ......... 1e-06
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_tol ............... 0.01
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   eigenvalue_verbose ........... False
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   elasticity_enabled ........... False
[2025-02-21 15:59:00,023] [INFO] [config.py:1005:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   fp16_auto_cast ............... None
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   fp16_enabled ................. False
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   fp16_master_weights_and_gradients  False
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   global_rank .................. 0
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   grad_accum_dtype ............. None
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   gradient_accumulation_steps .. 16
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   gradient_clipping ............ 1.0
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   gradient_predivide_factor .... 1.0
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   graph_harvesting ............. False
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   initial_dynamic_scale ........ 1
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   load_universal_checkpoint .... False
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   loss_scale ................... 1.0
[2025-02-21 15:59:00,024] [INFO] [config.py:1005:print]   memory_breakdown ............. False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   mics_hierarchial_params_gather  False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   mics_shard_size .............. -1
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   optimizer_legacy_fusion ...... False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   optimizer_name ............... adamw
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   pld_enabled .................. False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   pld_params ................... False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   prescale_gradients ........... False
[2025-02-21 15:59:00,025] [INFO] [config.py:1005:print]   scheduler_name ............... None
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   scheduler_params ............. None
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   seq_parallel_communication_data_type  torch.float32
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   sparse_attention ............. None
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   sparse_gradients_enabled ..... False
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   steps_per_print .............. inf
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   timers_config ................ enabled=True synchronized=True
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   train_batch_size ............. 16
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   train_micro_batch_size_per_gpu  1
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   use_data_before_expert_parallel_  False
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   use_node_local_storage ....... False
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   wall_clock_breakdown ......... True
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   weight_quantization_config ... None
[2025-02-21 15:59:00,026] [INFO] [config.py:1005:print]   world_size ................... 1
[2025-02-21 15:59:00,027] [INFO] [config.py:1005:print]   zero_allow_untested_optimizer  False
[2025-02-21 15:59:00,027] [INFO] [config.py:1005:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-02-21 15:59:00,027] [INFO] [config.py:1005:print]   zero_enabled ................. True
[2025-02-21 15:59:00,027] [INFO] [config.py:1005:print]   zero_force_ds_cpu_optimizer .. True
[2025-02-21 15:59:00,027] [INFO] [config.py:1005:print]   zero_optimization_stage ...... 1
[2025-02-21 15:59:00,027] [INFO] [config.py:991:print_user_config]   json = {
    "zero_optimization": {
        "stage": 1, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 1.000000e+09, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 1.000000e+09, 
        "contiguous_gradients": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 4e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.01
        }
    }, 
    "gradient_accumulation_steps": 16, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2025-02-21 15:59:00,027 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-02-21 15:59:00,027 >>   Num examples = 100
[INFO|trainer.py:1723] 2025-02-21 15:59:00,027 >>   Num Epochs = 1
[INFO|trainer.py:1724] 2025-02-21 15:59:00,027 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-02-21 15:59:00,028 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1728] 2025-02-21 15:59:00,028 >>   Gradient Accumulation steps = 16
[INFO|trainer.py:1729] 2025-02-21 15:59:00,028 >>   Total optimization steps = 6
[INFO|trainer.py:1730] 2025-02-21 15:59:00,031 >>   Number of trainable parameters = 8,798,208
  0%|          | 0/6 [00:00<?, ?it/s][2025-02-21 15:59:02,577] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740153547.073111   11342 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740153547.079620   11342 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2025-02-21 15:59:12,953] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740153557.378199   11422 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740153557.384515   11422 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2025-02-21 15:59:23,219] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740153567.963827   11502 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740153567.970943   11502 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2025-02-21 15:59:33,099] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740153579.027746   11584 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740153579.034483   11584 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
petrel_client is not installed. Using PIL to load images.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1072, in <module>
[rank0]:     main()
[rank0]:   File "/content/InternVL/internvl_chat/internvl/train/internvl_chat_finetune.py", line 1057, in main
[rank0]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1539, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1869, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2772, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2795, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:               ^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1987, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 164, in forward
[rank0]:     vit_embeds = self.extract_feature(pixel_values)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py", line 274, in extract_feature
[rank0]:     vit_embeds = self.vision_model(
[rank0]:                  ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 414, in forward
[rank0]:     encoder_outputs = self.encoder(
[rank0]:                       ^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 345, in forward
[rank0]:     layer_outputs = torch.utils.checkpoint.checkpoint(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_compile.py", line 32, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/utils/checkpoint.py", line 489, in checkpoint
[rank0]:     return CheckpointFunction.apply(function, preserve, *args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/utils/checkpoint.py", line 264, in forward
[rank0]:     outputs = run_function(*args)
[rank0]:               ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 291, in forward
[rank0]:     hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
[rank0]:                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 247, in forward
[rank0]:     x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
[rank0]:                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 239, in _flash_attn
[rank0]:     context, _ = self.inner_attn(
[rank0]:                  ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/InternVL/internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py", line 72, in forward
[rank0]:     output = flash_attn_varlen_qkvpacked_func(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/flash_attn/flash_attn_interface.py", line 1267, in flash_attn_varlen_qkvpacked_func
[rank0]:     return FlashAttnVarlenQKVPackedFunc.apply(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/flash_attn/flash_attn_interface.py", line 553, in forward
[rank0]:     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[rank0]:                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_library/autograd.py", line 113, in autograd_impl
[rank0]:     result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_library/autograd.py", line 40, in forward_no_grad
[rank0]:     result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_ops.py", line 721, in redispatch
[rank0]:     return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
[rank0]:     result = self._backend_fns[device_type](*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_compile.py", line 32, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/flash_attn/flash_attn_interface.py", line 170, in _flash_attn_varlen_forward
[rank0]:     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[rank0]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: FlashAttention only supports Ampere GPUs or newer.
  0%|          | 0/6 [00:50<?, ?it/s]
[rank0]:[W221 15:59:50.953384675 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E0221 15:59:51.817000 11198 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 11210) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
internvl/train/internvl_chat_finetune.py FAILED

The text was updated successfully, but these errors were encountered:

yuecao0119 · 2025-02-25T07:02:46Z

Thank you very much for your question.
You can refer to this part: #163 (comment), #572

292725651 · 2025-02-27T09:27:05Z

I have modified the configuration information following others' methods, but I noticed that flash_attn seems to have been imported in the script. How should I make the changes?

Traceback (most recent call last):
File "myworkdir/InternVL-main/internvl_chat/internvl/train/internvl_chat_finetune.py", line 35, in
from internvl.patch import (concat_pad_data_collator,
File "myworkdir/InternVL-main/internvl_chat/internvl/patch/init.py", line 7, in
from .internlm2_packed_training_patch import replace_internlm2_attention_class
File "myworkdir/InternVL-main/internvl_chat/internvl/patch/internlm2_packed_training_patch.py", line 8, in
from flash_attn.flash_attn_interface import flash_attn_varlen_func
ModuleNotFoundError: No module named 'flash_attn'

kachhadiyaraj15 · 2025-02-27T09:35:17Z

I saw in the code they changed the parameter with "flash_attention_2" directly they are not using the config as a parameter so this repo won't help you anymore, I didn't try this repo /~https://github.com/InternLM/xtuner you can try this one. Maybe this one help you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] FlashAttention Error During InternVL Fine-tuning on Tesla T4 GPU #918

[Bug] FlashAttention Error During InternVL Fine-tuning on Tesla T4 GPU #918

kachhadiyaraj15 commented Feb 21, 2025

yuecao0119 commented Feb 25, 2025

292725651 commented Feb 27, 2025

kachhadiyaraj15 commented Feb 27, 2025

[Bug] FlashAttention Error During InternVL Fine-tuning on Tesla T4 GPU #918

[Bug] FlashAttention Error During InternVL Fine-tuning on Tesla T4 GPU #918

Comments

kachhadiyaraj15 commented Feb 21, 2025

Checklist

Describe the bug

Reproduction

Environment

Error traceback

yuecao0119 commented Feb 25, 2025

292725651 commented Feb 27, 2025

kachhadiyaraj15 commented Feb 27, 2025