[Dependency Update] Upgrade cuDNN & NCCL #14884

stu1130 · 2019-05-05T23:52:30Z

Description

Upgrade the CUDA 9.0/9.2/10.0 with latest cuDNN 7.5.1 & NCCL 2.4.2

Checklist

Run three models ResNet50 with ImageNet & LSTM with PTB & MLP with MNIST
Performance shown below
Environment: P3.16xlarge Deep Learning Base AMI
Codebase: commit 1540a84
I also applied the #14837 PR change
The unit of thoughput is samples/per second
Each throughput is calcuated by average of 5 runs

ResNet

model: Resnet50
dataset: Imagenet
number of gpu: 8
epochs: 3 (only to test throughput)
preprocess command: sudo pip install gluoncv==0.2.0b20180625
command: python mxnet_benchmark/train_imagenet.py --use-rec --batch-size 128 --dtype float32 —num-data-workers 40 —num-epochs 3 —gpus 0,1,2,3,4,5,6,7 --lr 0.05 --last-gamma —mode symbolic —model resnet50_v1b —rec-train /home/ubuntu/data/train-passthrough.rec —rec-train-idx /home/ubuntu/data/train-passthrough.idx —rec-val /home/ubuntu/data/val-passthrough.rec —rec-val-idx /home/ubuntu/data/val-passthrough.idx
github repo: /~https://github.com/rahul003/deep-learning-benchmark-mirror.git*

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	2831.54405	2821.9832	0.339%
CUDA 9.2	2832.36803	2843.28968	-0.384%
CUDA 9.0	2815.83939	2851.92915	-1.265%

**There is another performance regression with --batch-size 256 --dtype float16 --mode hybrid, please find more details on #14838

LSTM

model: LSTM
dataset: PTB(Penn Treebank)
number of gpu: 1
epochs: 10
command:
python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local
python word_language_model/lstm_bucketing.py —num-hidden 650 —num-embed 650 —gpus 0 --epochs 10 --kv-store local

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	847.98222	868.28966	-2.339%
CUDA 9.2	1005.25185	1051.06692	-4.359%
CUDA 9.0	1002.59081	1028.46962	-1.265%

The CUDA 10 have a performance regression issue, please see #14725 to find more details.

MLP

model: 3 dense layers with num_hidden=64 and relu as activation
dataset: MNIST
number of gpu: 1
epochs: 10
command:
python2 benchmark_runner.py —framework mxnet —metrics-policy mlp —task-name mlp —metrics-suffix test —num-gpus 1 —command-to-execute 'python3 mlp.py' —data-set mnist

Throughput Tables	cuDNN 7.5.1/NCCL 2.4.2	cuDNN 7.3.1/NCCL 2.3.4	Perforamnce Difference
CUDA 10	4192.20685	4094.76838	2.38%
CUDA 9.2	4212.68214	4280.69164	-1.589%
CUDA 9.0	4232.10159	4273.43268	-0.967%

Comments

@szha @lanking520 @eric-haibin-lin

anirudhacharya · 2019-05-06T17:17:54Z

@mxnet-label-bot add [pr-awaiting-review]

This reverts commit 0255dd6.

* upgrade cuDNN & NCCL * retrigger CI

…che#14910) This reverts commit 0255dd6.

* upgrade cuDNN & NCCL * retrigger CI

…che#14910) This reverts commit 0255dd6.

stu1130 requested a review from szha as a code owner May 5, 2019 23:52

szha approved these changes May 6, 2019

View reviewed changes

upgrade cuDNN & NCCL

28af771

stu1130 force-pushed the upgrade_cudnn_nccl branch from 94dac09 to 28af771 Compare May 6, 2019 17:12

marcoabreu added the pr-awaiting-review PR is waiting for code review label May 6, 2019

retrigger CI

41c29f5

szha merged commit 0255dd6 into apache:master May 6, 2019

perdasilva mentioned this pull request May 7, 2019

[v1.4.x] Backport of pip package scripts #14906

Merged

perdasilva added a commit to perdasilva/incubator-mxnet that referenced this pull request May 7, 2019

Revert "[Dependency Update] Upgrade cuDNN & NCCL (apache#14884)"

aba740b

This reverts commit 0255dd6.

perdasilva mentioned this pull request May 7, 2019

Revert "[Dependency Update] Upgrade cuDNN & NCCL (#14884)" #14910

Merged

5 tasks

szha pushed a commit that referenced this pull request May 8, 2019

Revert "[Dependency Update] Upgrade cuDNN & NCCL (#14884)" (#14910)

0ddef13

This reverts commit 0255dd6.

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019

[Dependency Update] Upgrade cuDNN & NCCL (apache#14884)

cb9e987

* upgrade cuDNN & NCCL * retrigger CI

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019

Revert "[Dependency Update] Upgrade cuDNN & NCCL (apache#14884)" (apa…

69b99c5

…che#14910) This reverts commit 0255dd6.

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

[Dependency Update] Upgrade cuDNN & NCCL (apache#14884)

e99064b

* upgrade cuDNN & NCCL * retrigger CI

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Revert "[Dependency Update] Upgrade cuDNN & NCCL (apache#14884)" (apa…

48af343

…che#14910) This reverts commit 0255dd6.

stu1130 deleted the upgrade_cudnn_nccl branch January 12, 2020 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dependency Update] Upgrade cuDNN & NCCL #14884

[Dependency Update] Upgrade cuDNN & NCCL #14884

stu1130 commented May 5, 2019 •

edited

Loading

anirudhacharya commented May 6, 2019

[Dependency Update] Upgrade cuDNN & NCCL #14884

[Dependency Update] Upgrade cuDNN & NCCL #14884

Conversation

stu1130 commented May 5, 2019 • edited Loading

Description

Checklist

ResNet

LSTM

MLP

Comments

anirudhacharya commented May 6, 2019

stu1130 commented May 5, 2019 •

edited

Loading