Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed is low when run train_imagent.py #488

Closed
chinakook opened this issue Nov 28, 2018 · 6 comments
Closed

Speed is low when run train_imagent.py #488

chinakook opened this issue Nov 28, 2018 · 6 comments
Labels

Comments

@chinakook
Copy link
Member

CASE 1:

MXNET_ENABLE_GPU_P2P=0 python3 train_imagenet.py \
  --rec-train /media/bluews/WSData/jiaojie3/imagenet2012/train_rec/train_rec.rec \
  --rec-train-idx /media/bluews/WSData/jiaojie3/imagenet2012/train_rec/train_rec.idx \
  --rec-val /media/bluews/WSData/jiaojie3/imagenet2012/val/val.rec \
  --rec-val-idx /media/bluews/WSData/jiaojie3/imagenet2012/val/val.idx \
  --input-size 224 \
  --model resnet18_v1 --mode hybrid \
  --lr 0.1 --lr-mode cosine --num-epochs 120 --batch-size 48 --num-gpus 4 -j 2 \
  --use-rec --dtype float16 --warmup-epochs 5 --last-gamma --no-wd --label-smoothing \
  --save-dir params_imagenet \
  --logging-file imagenet_best.log

speed log:

Epoch[0] Batch [49]	Speed: 801.173036 samples/sec	accuracy=0.001146	lr=0.000147
Epoch[0] Batch [99]	Speed: 800.704017 samples/sec	accuracy=0.001198	lr=0.000297
Epoch[0] Batch [149]	Speed: 800.938578 samples/sec	accuracy=0.001250	lr=0.000447
Epoch[0] Batch [199]	Speed: 796.856581 samples/sec	accuracy=0.001302	lr=0.000597
Epoch[0] Batch [249]	Speed: 403.190169 samples/sec	accuracy=0.001250	lr=0.000746
Epoch[0] Batch [299]	Speed: 108.597701 samples/sec	accuracy=0.001267	lr=0.000896

CASE 2:

-j 4

speed log:

Epoch[0] Batch [49]	Speed: 106.327262 samples/sec	accuracy=0.001146	lr=0.000147

CASE 3:

-j 20

speed log:

Epoch[0] Batch [49]	Speed: 2299.076671 samples/sec	accuracy=0.001250	lr=0.000147
Epoch[0] Batch [99]	Speed: 257.348066 samples/sec	accuracy=0.001302	lr=0.000297
Epoch[0] Batch [149]	Speed: 106.823927 samples/sec	accuracy=0.001250	lr=0.000447
Epoch[0] Batch [199]	Speed: 107.819103 samples/sec	accuracy=0.001406	lr=0.000597
@zhreshold
Copy link
Member

This looks like a thread contention problem caused by opencv. I remember it was fixed apache/mxnet#12025.

It may due to something else since it's using recordio

@zhreshold
Copy link
Member

What's the version of mxnet? commit number?

@chinakook
Copy link
Member Author

chinakook commented Nov 29, 2018

MXNet Master compiled by myself. It's solved when I copied dataset from Western Digital Gold HDD to Samsung SSD.

@zhreshold
Copy link
Member

@chinakook Okay, I got it, it's stressing random access performance of the disk.

@yiwuyao3863
Copy link

@zhreshold
I also got this problem.
When I use gluon and horovod to tain resnet101 on single server with 4GPUs, the training speed slows down after 100 steps, from 400 samples/s to 80 samples/s。The size of train-set is 4819499, the batch-size for each GPU is 60.
How can I solve this problem, if I still use disk as storage? Thanks~

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants