Speed is low when run train_imagent.py #488

chinakook · 2018-11-28T12:41:19Z

CASE 1:

MXNET_ENABLE_GPU_P2P=0 python3 train_imagenet.py \
  --rec-train /media/bluews/WSData/jiaojie3/imagenet2012/train_rec/train_rec.rec \
  --rec-train-idx /media/bluews/WSData/jiaojie3/imagenet2012/train_rec/train_rec.idx \
  --rec-val /media/bluews/WSData/jiaojie3/imagenet2012/val/val.rec \
  --rec-val-idx /media/bluews/WSData/jiaojie3/imagenet2012/val/val.idx \
  --input-size 224 \
  --model resnet18_v1 --mode hybrid \
  --lr 0.1 --lr-mode cosine --num-epochs 120 --batch-size 48 --num-gpus 4 -j 2 \
  --use-rec --dtype float16 --warmup-epochs 5 --last-gamma --no-wd --label-smoothing \
  --save-dir params_imagenet \
  --logging-file imagenet_best.log

speed log:

Epoch[0] Batch [49]	Speed: 801.173036 samples/sec	accuracy=0.001146	lr=0.000147
Epoch[0] Batch [99]	Speed: 800.704017 samples/sec	accuracy=0.001198	lr=0.000297
Epoch[0] Batch [149]	Speed: 800.938578 samples/sec	accuracy=0.001250	lr=0.000447
Epoch[0] Batch [199]	Speed: 796.856581 samples/sec	accuracy=0.001302	lr=0.000597
Epoch[0] Batch [249]	Speed: 403.190169 samples/sec	accuracy=0.001250	lr=0.000746
Epoch[0] Batch [299]	Speed: 108.597701 samples/sec	accuracy=0.001267	lr=0.000896

CASE 2:

-j 4

speed log:

Epoch[0] Batch [49]	Speed: 106.327262 samples/sec	accuracy=0.001146	lr=0.000147

CASE 3:

-j 20

speed log:

Epoch[0] Batch [49]	Speed: 2299.076671 samples/sec	accuracy=0.001250	lr=0.000147
Epoch[0] Batch [99]	Speed: 257.348066 samples/sec	accuracy=0.001302	lr=0.000297
Epoch[0] Batch [149]	Speed: 106.823927 samples/sec	accuracy=0.001250	lr=0.000447
Epoch[0] Batch [199]	Speed: 107.819103 samples/sec	accuracy=0.001406	lr=0.000597

The text was updated successfully, but these errors were encountered:

zhreshold · 2018-11-29T00:24:38Z

This looks like a thread contention problem caused by opencv. I remember it was fixed apache/mxnet#12025.

It may due to something else since it's using recordio

zhreshold · 2018-11-29T00:25:12Z

What's the version of mxnet? commit number?

chinakook · 2018-11-29T04:15:01Z

MXNet Master compiled by myself. It's solved when I copied dataset from Western Digital Gold HDD to Samsung SSD.

zhreshold · 2018-11-29T17:51:36Z

@chinakook Okay, I got it, it's stressing random access performance of the disk.

yiwuyao3863 · 2019-03-01T03:39:28Z

@zhreshold
I also got this problem.
When I use gluon and horovod to tain resnet101 on single server with 4GPUs, the training speed slows down after 100 steps, from 400 samples/s to 80 samples/s。The size of train-set is 4819499, the batch-size for each GPU is 60.
How can I solve this problem, if I still use disk as storage? Thanks~

github-actions · 2021-05-23T06:41:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions bot added the Stale label May 23, 2021

github-actions bot closed this as completed May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed is low when run train_imagent.py #488

Speed is low when run train_imagent.py #488

chinakook commented Nov 28, 2018

zhreshold commented Nov 29, 2018

zhreshold commented Nov 29, 2018

chinakook commented Nov 29, 2018 •

edited

Loading

zhreshold commented Nov 29, 2018

yiwuyao3863 commented Mar 1, 2019

github-actions bot commented May 23, 2021

Speed is low when run train_imagent.py #488

Speed is low when run train_imagent.py #488

Comments

chinakook commented Nov 28, 2018

CASE 1:

CASE 2:

CASE 3:

zhreshold commented Nov 29, 2018

zhreshold commented Nov 29, 2018

chinakook commented Nov 29, 2018 • edited Loading

zhreshold commented Nov 29, 2018

yiwuyao3863 commented Mar 1, 2019

github-actions bot commented May 23, 2021

chinakook commented Nov 29, 2018 •

edited

Loading