The problem of convergence time in training the xfeat model #99

codekiller000 · 2025-02-26T13:08:48Z

Hi, the xeat model is a great idea, but I encountered some difficulties while training it.
My graphics card is 4090, and my training hyperparameters are as follows:
parser.add_argument('--megadepth_root_path', type=str, default='/sdb1/zyh/DataSet/MegaDepth_v1_0/phoenix/S6/zl548',
help='Path to the MegaDepth dataset root directory.')
parser.add_argument('--synthetic_root_path', type=str, default='/home/zyh/zyh_project/accelerated_features/modules/dataset/coco_20k',
help='Path to the synthetic dataset root directory.')
parser.add_argument('--ckpt_save_path', type=str, default='/home/zyh/zyh_project/accelerated_features/models/default_ckpts',
help='Path to save the checkpoints.')
parser.add_argument('--training_type', type=str, default='xfeat_default',
choices=['xfeat_default', 'xfeat_synthetic', 'xfeat_megadepth'],
help='Training scheme. xfeat_default uses both megadepth & synthetic warps.')
parser.add_argument('--batch_size', type=int, default=20,
help='Batch size for training. Default is 10.')
parser.add_argument('--n_steps', type=int, default=80000,
help='Number of training steps. Default is 160000.')
parser.add_argument('--lr', type=float, default=3e-4,
help='Learning rate. Default is 0.0003.')
parser.add_argument('--gamma_steplr', type=float, default=0.5,
help='Gamma value for StepLR scheduler. Default is 0.5.')
parser.add_argument('--training_res', type=lambda s: tuple(map(int, s.split(','))),
default=(800, 608), help='Training resolution as width,height. Default is (800, 608).')
parser.add_argument('--device_num', type=str, default='0',
help='Device number to use for training. Default is "0".')
parser.add_argument('--dry_run', action='store_true',
help='If set, perform a dry run training with a mini-batch for sanity check.')
parser.add_argument('--save_ckpt_every', type=int, default=1000,
help='Save checkpoints every N steps. Default is 500.')
I use the default training mode to train the coco synthesis dataset and megadepth dataset,but my training time is particularly long:

/home/zyh/anaconda3/envs/patchcore/bin/python /home/zyh/zyh_project/accelerated_features/modules/training/train.py
/home/zyh/anaconda3/envs/patchcore/lib/python3.8/site-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3549.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[Synthetic] Found a total of 20000 images for training..
loading train: 100%|██████████| 3000/3000 [00:50<00:00, 59.57it/s]
loading test: 100%|██████████| 5/5 [00:00<00:00, 47.28it/s]
[MegaDepth] Loading metadata: 0%| | 0/441 [00:00<?, ?it/s]NPZ Path: /sdb1/zyh/DataSet/MegaDepth_v1_0/phoenix/S6/zl548/train_data/megadepth_indices/scene_info_0.1_0.7
Found npz files: 441
[MegaDepth] Loading metadata: 100%|██████████| 441/441 [02:03<00:00, 3.58it/s]
0%| | 0/80000 [00:00<?, ?it/s]/home/zyh/zyh_project/accelerated_features/modules/training/losses.py:105: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
kpts = F.log_softmax(kpts)
Loss: 7.6785 acc_c0 0.048 acc_c1 0.158 acc_f: 0.000 loss_c: 7.767 loss_f: 7.874 loss_kp: 0.175 #matches_c: 1749 loss_kp_pos: 12.227 acc_kp_pos: 0.213: 0%| | 359/80000 [37:17<137:05:34, 6.20s/it]
As you can see, my training time has reached nearly 137 hours, which is clearly unacceptable.My graphics card's memory usage is 9006MiB.
Your paper mentioned that using a 4090 graphics card can make the model converge within 36 hours. I don't know how to achieve this goal.You also mentioned in your paper that disk IO is the main speed bottleneck, which can be solved through more careful data preparation solutions .Can you guide me in terms of code? I would greatly appreciate it if you could reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The problem of convergence time in training the xfeat model #99

The problem of convergence time in training the xfeat model #99

codekiller000 commented Feb 26, 2025

The problem of convergence time in training the xfeat model #99

The problem of convergence time in training the xfeat model #99

Comments

codekiller000 commented Feb 26, 2025