Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Dual stream cudnn Convolution backward() with MXNET_GPU_WORKER_NSTREAMS=2. #14006

Merged
merged 11 commits into from
Feb 24, 2019

Conversation

DickJC123
Copy link
Contributor

Description

This PR adds a 2nd 'auxiliary' stream to a RunContext and makes it available to operators. This PR includes a modification of the Backward() operation of the cudnn implementation of Convolution to run both the dgrad and wgrad kernels in parallel, rather than in series. For large batchsizes (e.g. 256), each of these kernels consumes all of a GPU and the training performance improvement is negligible. However, when the per-GPU batchsize is small (e.g. 32, as is desirable for fast 'scale out' training that maintains accuracy), the performance improvement is in the neighborhood of 2-3%. More details in a follow-up comment.

By default, this PR does not affect the behavior of the framework. However, by setting the environment variable MXNET_GPU_WORKER_NSTREAMS=2, the cudnn Convolution backward dgrad and wgrad will be run in separate streams. The resulting speed-up comes with a modest downside- now the kernel workspace areas cannot be shared, so the model global memory footprint size grows by 2-3% in the case of Resnet50. This is of no consequence for the main application area of this new feature- small batchsize (per GPU) training.

The bottom line is this can be a useful optional knob for users, particularly those attempting to duplicate published MLPERF results. This PR includes a test of cudnn vs. no_cudnn Convolution with MXNET_GPU_WORKER_NSTREAMS set to both 1 and 2.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [ X] Changes are complete (i.e. I finished coding on this PR)
  • [C ] All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • [ C] Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [ X] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@marcoabreu
Copy link
Contributor

Instead of introducing another "knob". couldn't we make the system "smart" enough to figure out the right option on its own? I'm afraid that a high amount of configuration options might fear off users or they'll simply never hear about them. This could lead to first experiments to go worse than MXNet is capable of because the default settings were not optimal.

@DickJC123 DickJC123 requested a review from szha as a code owner January 29, 2019 03:10
@DickJC123
Copy link
Contributor Author

I think we're not at the point of having the framework be smart enough to make this trade-off, which involves a potential performance increase at the expense of a larger model global memory footprint. I think users would be upset if suddenly they have to drop their batchsize due to an out-of-memory error (and lose perf) because the framework was ill-advisedly stretching for the ultimate efficiency. On the other hand, allowing experts to fine-tune performance for production uses and demonstrators is important. I think a next step toward what you're asking is a storage allocation that is allowed to return a null pointer (rather than immediately exiting). That would allow operators to take sensible fall-back approaches if memory is exhausted. This would be a different PR after some discussion.

In the meantime, I still recommend adding this knob in the form of the MXNET_GPU_WORKER_NSTREAMS environment variable. I think it natural to first introduce a facility with a manual control knob, then evolve to a point where the framework picks the best setting. The control knob could be retained quietly in the background to support testing and to prove that the automatic selection is performing correctly.

@szha
Copy link
Member

szha commented Jan 30, 2019

Will take a look soon. RNN can benefit from the additional stream too.

@marcoabreu
Copy link
Contributor

Good point, thanks

@vandanavk
Copy link
Contributor

@mxnet-label-bot update [CUDA, Operator, pr-work-in-progress]

@marcoabreu marcoabreu added pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review labels Feb 5, 2019
@DickJC123
Copy link
Contributor Author

@szha I reworked the implementation in response to your suggestions. Eager for your impression. I'm tracking down a CI issue with spawning test processes on CentOS 7 (don't see it on Ubuntu 16.04). I'm hoping that will result in only a minor tweek from the current state of the PR.

@DickJC123
Copy link
Contributor Author

DickJC123 commented Feb 16, 2019

I've rerun some perf analysis of this PR, which I'll remind everyone changes nothing in the default. However, when I set MXNET_GPU_WORKER_NSTREAMS=2, I see higher performance for all batchsizes. The perf gains I measured on a run across 8 Volta GPUs of Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container) were:

batchsize  32: 6.0% speedup
batchsize  64: 0.8% speedup
batchsize 128: 1.6% speedup
batchsize 256: 0.4% speedup

The primary application area of the PR is one of scale-out training across multiple nodes, where a too-large global batchsize can impact final accuracy (thus driving per-GPU batchsize down). The RN50 global memory increase was from 1.4% (bs 32) to 2.6% (bs 256).

This work is no longer "in progress." Requesting final review thanks. @szha @marcoabreu

@szha szha removed the pr-work-in-progress PR is still work in progress label Feb 16, 2019
@szha szha requested a review from eric-haibin-lin February 16, 2019 04:21
@@ -174,6 +174,12 @@ When USE_PROFILER is enabled in Makefile or CMake, the following environments ca

## Other Environment Variables

* MXNET_GPU_WORKER_NSTREAMS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me the impression that nstream may support more than 2 in the future. However, the c interface only returns an aux stream. Do you foresee that more than 1 aux stream is needed to help accelerating parallel computation in the future? If so, do we need to update the interface to reflect that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that 'yes', an operator with 3 inputs might make use of 3 streams in Backward(), so I did not want to propose an environment variable name like MXNET_GPU_WORKER_USE_DUAL_STREAM=0/1 that might soon become obsolete. On the other hand, Convolution only needs 2 streams, and I did not want to burden this enhancement with more complexity than is needed at this time. I propose that when we have a use-case for 3 or more streams, then we can expand the implementation and employ the use-case in our testing of it.

At the end of every kernel execution, there is a fall-off in GPU utilization leading up to the completion of the last grid block. When two streams are being used, these utilization gaps can be filled by work from the second stream. I would guess that having 3 streams would not enhance this effect. On the other hand, let's say you had 3 small independent kernels that each would occupy a third of the GPU. You could see how having 3 streams would be a win in this case over 2 streams.

So it's good that you ask, how might we expand this to 3 or more streams? The MXNET_GPU_WORKER_NSTREAMS environment variable would remain unchanged, though the documentation would indicate that the framework supports a value greater than 2. Legacy env-var uses would be preserved so I think this could happen as part of a minor release. At the RunContext level, a GPUAuxStream* would be replaced by a std::vector<GPUAuxStream*>. The RunContext method get_gpu_aux_stream() might then be changed to RunContex::get_gpu_aux_stream(int aux_stream_id = 0), which would not break operator code that started using the simpler aux_stream API proposed by this PR.

@DickJC123
Copy link
Contributor Author

After the rework of this PR to make it far simpler to use within operators, I went back and re-measured the 1 GPU training speeds. The perf gains I measured on a run across 1 32g Volta GPU of Resnet50 v1b (also with DALI in NVIDIA's MXNet container) were:

batchsize  32: 2.9% speedup
batchsize  64: 1.4% speedup
batchsize 128: 0.95% speedup
batchsize 256: 0.15% speedup

The speedup is based on a comparison of the 2nd epoch "time cost", where the 1st epoch time is not considered because of cuDNNFind() and DALI overheads that are unique to the 1st epoch.

Single GPU training is not really the target of this PR, but at least this shows there's still a 1% improvement at a typical batchsize of 128. I don't recommend enabling 2 streams by default however because the increased use of global memory might make some user's models too big to run.

Looking for any further reviewer input.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. @szha could you check if the code change addresses your concerns?

@szha
Copy link
Member

szha commented Feb 24, 2019

yes my concerns are addressed with a clean solution. thanks @DickJC123 @ptrendx

@szha szha merged commit 5f32f32 into apache:master Feb 24, 2019
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 28, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 28, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 28, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Mar 4, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…MS=2. (apache#14006)

* Dual stream conv backward().  Enable with MXNET_GPU_WORKER_NSTREAMS=2.

* Fix for MSVC compiler.

* Fix cpplint.

* Add MXNET_GPU_WORKER_NSTREAMS env var documentation.

* Improve test function and commenting.

* Add description of proper aux stream use using events.

* RAII rework to simplify usage within operators.

* Fix cpplint.

* Expand testing to cover all engines.

* Fix NaiveEngine shutdown segfault on CentOS7.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants