-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @ptrendx , Thanks for submitting the PR
CI supported jobs: [clang, centos-cpu, windows-gpu, windows-cpu, unix-cpu, website, miscellaneous, sanity, edge, unix-gpu, centos-gpu] Note: |
Forgot to add in the description that most of the work here was done by @DickJC123. |
.set_attr<FCompute>("FCompute<gpu>", EigvalsOpForward<gpu>); | ||
|
||
#if MXNET_USE_CUSOLVER == 1 | ||
|
||
NNVM_REGISTER_OP(_npi_eigvalsh) | ||
.set_attr<FIsCUDAGraphsCompatible>("FIsCUDAGraphsCompatible", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
istead of setting false everywhere, can we just check if hasAttr("FIsCUDAGraphsCompatible")
so that by default its false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the "everywhere" here is actually way less than if we went the other way around (only a bunch of operators with FCompute
has synchronization that is not allowed under graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the long term goal here is actually to make those excluded operators compatible too.
static auto& fgraphcompatible = Op::GetAttr<FIsCUDAGraphsCompatible>("FIsCUDAGraphsCompatible"); | ||
const auto& attrs = exec->attrs; | ||
if (attrs.op != nullptr) { | ||
const auto f = fgraphcompatible.get(attrs.op, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand - I do want to have the function call per Op
here, not just a simple true
/false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the point of registering a lambda function that just returns false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess look here: /~https://github.com/apache/incubator-mxnet/pull/19142/files#diff-789523bf443903e74acfa010a5d6b572R33-R37 - this is for dropout, which uses random resource for training (and thus is excluded), but is just a passthrough for inference (and so we want to include it there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya that one is fine, but "_npi_eig" is always false...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then I would need to add that function for almost all operators for it to return true...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went with the default to be include the operator, but instead have the functionality itself be non-default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the majority of ops supported to be included in cuda graphs? If a new user comes in to write an op, do they need to be aware of how to handle cuda graph support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first question: yes - nearly all of the FCompute functions that do not use random resource are compatible with graphs.
For the second question - for FCompute operators to not be compatible you need to do synchronization inside the operator - either via stream synchronize or allocation. You generally do not want to do either one of those as it really hurts performance (and the operators that in this PR I marked incompatible in this PR do just that). If you just launch a kernel (or multiple), which is the case for vast majority of the operators, then you are good and do not even need to think about graphs - it will just work.
I'm still exploring the ways of automatically testing newly added operators in order for the feature to be able to be on by default, but I do not consider this the scope of this PR, as v1.x branch is not really supposed to get many more operators (I will do that in the PR to master). Generally this would involve testing operators with hybridize(static_alloc=True, static_shape=True)
(which generally should be tested much more as right now testing of this functionality is really limited, even though it is widely used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be ok since its disabled by default, and only enabled explicitly by setting the env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @ptrendx mentioned, I coded much of this facility, as embodied in the cuda_graphs.h file. I have reviewed that file carefully and found it to be true to the version we have been using in NVIDIA's container version of MXNET. Thanks @ptrendx for adding the FIsCUDAGraphsCompatible attribute, as is needed for a robust CUDA Graphs support, and other improvements.
LGTM.
* Initial cherry-pick * Store NodeAttrs in OpExecutor * Do not allow stateful operations in CUDA graphs and provide mechanism for marking ops as safe * Guard against using ops with synchronization * Cleaning * Properly guard graphs * Limit graphs to CUDA 10.2+ * Fix the compilation when graphs are not available * Guarding the libcuda.so usage behind RTC compilation flag * Document the env variables * Add test * Fix the test * Use with_environment
* Initial cherry-pick * Store NodeAttrs in OpExecutor * Do not allow stateful operations in CUDA graphs and provide mechanism for marking ops as safe * Guard against using ops with synchronization * Cleaning * Properly guard graphs * Limit graphs to CUDA 10.2+ * Fix the compilation when graphs are not available * Guarding the libcuda.so usage behind RTC compilation flag * Document the env variables * Add test * Fix the test * Use with_environment
* Initial cherry-pick * Store NodeAttrs in OpExecutor * Do not allow stateful operations in CUDA graphs and provide mechanism for marking ops as safe * Guard against using ops with synchronization * Cleaning * Properly guard graphs * Limit graphs to CUDA 10.2+ * Fix the compilation when graphs are not available * Guarding the libcuda.so usage behind RTC compilation flag * Document the env variables * Add test * Fix the test * Use with_environment
* Initial cherry-pick * Store NodeAttrs in OpExecutor * Do not allow stateful operations in CUDA graphs and provide mechanism for marking ops as safe * Guard against using ops with synchronization * Cleaning * Properly guard graphs * Limit graphs to CUDA 10.2+ * Fix the compilation when graphs are not available * Guarding the libcuda.so usage behind RTC compilation flag * Document the env variables * Add test * Fix the test * Use with_environment
* [1.x][FEATURE] CUDA graphs support (#19142) * Initial cherry-pick * Store NodeAttrs in OpExecutor * Do not allow stateful operations in CUDA graphs and provide mechanism for marking ops as safe * Guard against using ops with synchronization * Cleaning * Properly guard graphs * Limit graphs to CUDA 10.2+ * Fix the compilation when graphs are not available * Guarding the libcuda.so usage behind RTC compilation flag * Document the env variables * Add test * Fix the test * Use with_environment * Fix compile and test_cuda_graphs * Fix lint * Mark more ops as not CUDA Graphs compatible * Mark some linalg ops as not CUDA Graphs compatible * Marked 2 ops CUDA Graphs incompatible due to cpu->gpu copy * Mark cuDNN Dropout as fully CUDA Graphs compatible. Reenable tests. * clang-tidy fixes * More clang-tidy fixes * Avoid CUDA_CALL(e): improper macro expansion * Add compile guard to Dropout's FIsCUDAGraphsCompatible def * Temporarily add '-s' to pytest serial tests * Fix DropoutOp.dropout_passthrough_ handling for CUDA Graphs * Adapt test_gluon_gpu.py::test_cuda_graphs for gluon2.0 * Create CUDA Graph 'dot' files if MXNET_CUDA_GRAPHS_DBG_FILE=<file_prefix> * Fix clang-tidy * Fix more clang-tidy * Skip test_np_standard_binary_funcs test of 0-dim array broadcast * Improve test_rnn_layers_fp{16,32} invocation * Run test_rnn_layers_fp32 only when cuDNN is present * Fix potential out-of-bounds write in count_sketch.cu * Add temp output to debug centos crash * Mark InstanceNorm and LeakyRELU as not CUDA Graphs compatible * Ops calling FStatefulCompute* are not CUDA Graphs compatible by default * Fix clang-tidy * Revert "Add temp output to debug centos crash" This reverts commit e013a85. * Quiet 'unused variable' compilation warning * Trigger CI * Check of FCreateOpState removed given new check for FStatefulCompute* * Revert "Temporarily add '-s' to pytest serial tests" This reverts commit 5a2f847. Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Description
CUDA graphs is a feature of CUDA 10, which enables lowering CPU overhead by bundling multiple kernel launches together.
The main limitation of CUDA graphs is that they do require the graph to be static - no parameters to the kernels can change, otherwise the graph needs to be recaptured. That is why this feature is currently only enabled for the symbolic models and Gluon models hybridized with
hybridize(static_alloc=True, static_shape=True)
.The feature is not enabled by default and requires environment variable
MXNET_ENABLE_CUDA_GRAPHS
to be set. In order to not capture the operations, the execution of which may change during the course of the training job, stateful operators and operators relying on resources other than workspace are not included in the graph.Since the feature lowers the CPU overhead of the execution, the impact is most visible in the inference or scale-out workloads with small batch size. Let us consider following script comparing fp16, batch size 1 inference of RN50v2 model from GluonCV with and without graphs:
The result obtained on V100 16GB:
so over 17% increase in performance. The same script but with 128 batch size gives much smaller improvement:
so 0.3%.
Checklist
Essentials