Fix Tree Reduction on new instance type p3dn.24xlarge #13852

ctcyang · 2019-01-11T20:17:49Z

Description

Fixes tree reduction failing on new instance type p3dn.24xlarge, which was raised here: dmlc/gluon-nlp#520

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Adds a fallback for cudaDeviceGetP2PAttribute topology information to use cudaDeviceEnablePeerAccess when the former is inconsistent across instance types.
Fixes tree reduction breaking when cudaDeviceGetP2PAttribute provides wrong information.

Comments

On p3dn.24xlarge using CUDA 9.0, we observe that the cudaDeviceGetP2PAttribute topology looks different compared to when we use CUDA 9.0 with p3.16xlarge:

  // Check that all P2P connections are detected by GetP2PAttribute
  // If yes, then continue as before
  // If not, then fallback to using cudaDeviceEnablePeerAccess saved in p2p_matrix 
  //   (from EnableP2P() in src/kvstore/comm_tree.h)
  //
  // We have observed that with CUDA 9.0 p3.16xlarge:
  //
  //   0 2 2 3 3 1 1 1                . v v v v . . .
  //   2 0 3 2 1 3 1 1                v . v v . v . .
  //   2 3 0 3 1 1 2 1                v v . v . . v .
  //   3 2 3 0 1 1 1 2                v v v . . . . v
  //   3 1 1 1 0 2 2 3                v . . . . v v v
  //   1 3 1 1 2 0 3 2                . v . . v . v v
  //   1 1 2 1 2 3 0 3                . . v . v v . v
  //   1 1 1 2 3 2 3 0                . . . v v v v .
  //
  //        matrix                       p2p_matrix
  // cudaDeviceGetP2PAttribute   cudaDeviceEnablePeerAccess
  //
  // Here, they are correctly detected, because the 2s and 3s correspond to
  // links that have P2P connections between them. However for CUDA 9.0 p3dn.24xlarge:
  //
  //   0 2 2 1 1 1 1 1                . v v v v . . .
  //   2 0 1 2 1 1 1 1                v . v v . v . .
  //   2 1 0 1 1 1 2 1                v v . v . . v .
  //   1 2 1 0 1 1 1 2                v v v . . . . v
  //   1 1 1 1 0 2 2 1                v . . . . v v v
  //   1 1 1 1 2 0 1 2                . v . . v . v v
  //   1 1 2 1 2 1 0 1                . . v . v v . v
  //   1 1 1 2 1 2 1 0                . . . v v v v .
  //  
  //        matrix                      p2p_matrix
  // cudaDeviceGetP2PAttribute   cudaDeviceEnablePeerAccess
  //
  // The fastest connections (3 i.e. double NVLink) are not recognized as being a
  //   connection

…net-1 into fix_gputopo

src/kvstore/gpu_topology.h

codecov-io · 2019-01-11T22:09:43Z

Codecov Report

Merging #13852 into master will decrease coverage by 9.59%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master   #13852     +/-   ##
=========================================
- Coverage   81.64%   72.04%   -9.6%     
=========================================
  Files         721      717      -4     
  Lines       80158    81483   +1325     
  Branches     3233     3233             
=========================================
- Hits        65441    58704   -6737     
- Misses      13847    21898   +8051     
- Partials      870      881     +11

Impacted Files	Coverage Δ
src/kvstore/comm_tree.h	`0% <0%> (-95.89%)`	⬇️
src/kvstore/gpu_topology.h	`84.9% <0%> (-6.21%)`	⬇️
src/operator/nn/mkldnn/mkldnn_concat-inl.h	`0% <0%> (-100%)`	⬇️
src/operator/subgraph/mkldnn/mkldnn_conv-inl.h	`0% <0%> (-100%)`	⬇️
src/operator/cudnn_spatial_transformer-inl.h	`0% <0%> (-100%)`	⬇️
src/operator/cudnn_bilinear_sampler-inl.h	`0% <0%> (-100%)`	⬇️
...erator/tensor/elemwise_binary_broadcast_op-inl.cuh	`0% <0%> (-100%)`	⬇️
...erator/quantization/mkldnn/mkldnn_requantize-inl.h	`0% <0%> (-98.58%)`	⬇️
src/operator/nn/mkldnn/mkldnn_lrn-inl.h	`0% <0%> (-98.49%)`	⬇️
src/operator/nn/mkldnn/mkldnn_concat.cc	`1.56% <0%> (-98.44%)`	⬇️
... and 386 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6ed619...1ca7aab. Read the comment docs.

anirudhacharya · 2019-01-11T22:56:24Z

@mxnet-label-bot add [pr-awaiting-review]

* add fallback for gpu topology detection using CUDA 9.2 * add fallback for gpu topology detection using CUDA 9.2 * add log * update 3rdparty to master * add fallback for gpu topology detection using CUDA 9.2 * add log * update 3rdparty to master * bring 3rdparty packages to upstream/master * rebase to master * Update gpu_topology.h

Carl Yang added 11 commits August 7, 2018 23:38

add fallback for gpu topology detection using CUDA 9.2

0ed8991

add fallback for gpu topology detection using CUDA 9.2

dad855f

Merge branch 'fix_gputopo' of /~https://github.com/ctcyang/incubator-mx…

103dcab

…net-1 into fix_gputopo

add log

ad2baee

update 3rdparty to master

497efac

add fallback for gpu topology detection using CUDA 9.2

df2262e

add log

4ba1a83

update 3rdparty to master

5135309

bring 3rdparty packages to upstream/master

9ae5723

Merge branch 'fix_gputopo' of /~https://github.com/ctcyang/incubator-mx…

d415d22

…net-1 into fix_gputopo

rebase to master

50d6630

eric-haibin-lin reviewed Jan 11, 2019

View reviewed changes

src/kvstore/gpu_topology.h Show resolved Hide resolved

Update gpu_topology.h

1ca7aab

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 11, 2019

eric-haibin-lin approved these changes Jan 12, 2019

View reviewed changes

eric-haibin-lin merged commit 0d76675 into apache:master Jan 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Tree Reduction on new instance type p3dn.24xlarge #13852

Fix Tree Reduction on new instance type p3dn.24xlarge #13852

ctcyang commented Jan 11, 2019 •

edited

Loading

codecov-io commented Jan 11, 2019

anirudhacharya commented Jan 11, 2019

Fix Tree Reduction on new instance type p3dn.24xlarge #13852

Fix Tree Reduction on new instance type p3dn.24xlarge #13852

Conversation

ctcyang commented Jan 11, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

codecov-io commented Jan 11, 2019

Codecov Report

anirudhacharya commented Jan 11, 2019

ctcyang commented Jan 11, 2019 •

edited

Loading