-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
src/operator/nn/pool_utils.h
Outdated
} | ||
}; | ||
|
||
template<typename DType> | ||
struct lp_grad<DType, 3> { | ||
static MSHADOW_XINLINE DType Map(const DType grad, const DType in_data, const DType out_data) { | ||
return grad * in_data * in_data / (out_data * out_data); | ||
// Avoid nan result if both grad and out_data are 0. | ||
return (grad == DType(0.0)) ? DType(0.0) : grad * in_data * in_data / (out_data * out_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also check out_data == DType(0.0) here if it can be zero, to avoid overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced this to quiet some nans that were causing test failures. I'm reviewing the specifics to see if a better solution exists as suggested in your comment, so stay tuned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed my solution to your comment in commit 098bc49.
The checking of grad == 0.0 that you highlighted only succeeded because of the quirks of our check_consistency() routine in test_utils.py, which uses the symbol forward() output as the gradient. Per your suggestion, I'm now using a check of out_data == 0 as the more general way of quieting the test failures.
The test failures I was seeing often occurred in float16 lp-3 pooling. By example, take the case where a pool window of 2 has identical inputs 2^-9 and 2^-9. The forward output for this case is the cube root of (2^-9)^3 + (2^-9)^3. If this is calculated in float16, the 2^-27 terms underflow to 0 and the output is 0. The backward output is then grad * 2^-9 * 2^-9 / (0 * 0) = +inf (or nan if grad is also 0). When performed in float32, no underflow occurs in the forward op, and +infs are avoided in the backward op.
My conclusion: float16 is ill-equipped to perform the forward pooling operation for lp-2 and lp-3. Part of my solution here thus involves promoting the calculation to be in float32 for cpu and mxnet cuda implementations of float16-i/o pooling. This is consistent with other operators like float16-i/o Convolution and Batchnorm, which perform internal calculations in float32. I've run test_pooling_versions() thousands of times with no failures in this mode.
src/operator/nn/pooling.cc
Outdated
return std::vector<std::pair<int, int> >(); | ||
#else | ||
return std::vector<std::pair<int, int> >{{1, 0}}; | ||
#if MXNET_USE_MKLDNN == 1 && MXNET_USE_CUDA == 0 && MXNET_USE_CUDNN == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummm, do you have any case to validate this scenario? If MXNet is built with both MKL-DNN and cuDNN enabled (eg. mxnet-cu90mkl package) and user gives ctx=cpu
. Is there any functionality or performance regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unfortunately, all enabled backend operator implementations must live with the same FInPlaceOptions. I voiced concern about this restriction to @eric-haibin-lin and he graciously put some thought into: #13598. The issue has seen no activity in a month, but I'd like to see it reawakened.
Until then, the code you highlight enforces that the most conservatinve "least common denominator" FInPlaceOptions (i.e. none) are in effect for the enabled Pooling implementations. Since an operator cannot rely on FInPlace actually occurring, there should be no functional regression. And perf-wise, a user must understand that at this point enabling multiple back-ends may have some performance consequences due to the FInPlaceOption implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for any confusion. Perhaps my comments in the code for FInplaceOption speak to the issue more directly:
// Since this routine is not provided the cpu/gpu context info, only in the case
// where CUDA and CUDNN implementations are not available can we be sure the MKLDNN
// implementation will be employed. The MKLDNN FInplaceOptions are not compatible
// with the other (i.e. cpu, cuda and cudnn) implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I understand the current limitation in here.
My concern is the performance regression in the CPU side when mxnet-cuXXmkl binary is used which actually is the widely used binary like sockeye, gluonNLP, gluonCV, etc.
Before we resolved issue #13598 (btw, we're working on this now), I think we need to keep this line AS-IS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mxnet-cuXX-mkl pip binary statically links cudnn and will not be affected. @szha correct me if I am wrong.
The potentially affected cases are:
- build mxnet with mkldnn and cuda, but without cudnn
- build mxnet with neither mkldnn nor cuda.
As this operator is used in backward for training, I see the two cases above are rare. Therefore, I do not think this is a blocking issue for this PR.
Nonetheless, I suggest @DickJC123 create a github issue with this change so that people are aware and can be tracked later, and preferably run some performance testing to have a rough idea how big an impact it is making for the above two cases.
Let's continue discussions on #13598 to avoid such problems in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally set out to create a test case (and separate issue) to demonstrate the need for the restrictive #if clause I put around the MKLDNN FInplaceOption. I was unable to do so and in the process I came to realize that I misunderstood the meaning behind the {1,0} option. In fact, it allows the framework to share the backward '1' input (the unused workspace gradient) with the backward '0' output (the pooling input gradient) where the additional MKLDNN workspace is present. This sharing is compatible with CUDA and cuDNN Pooling implementations, which do not use a workspace or its gradient! I'm therefore reverting back to a simple code guard around the MKLDNN FinplaceOption- one whose logic mirrors the other MKLDNN/no-MKLDNN switches of behavior within nn/pooling.cc. No new issue is required. With my last commit I've also expanded the pooling tests to cover cases that have the inplace tensor sharing, which requires the pooling input to be the same size as the output (and also then the workspace).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know MKLDNN impl is not affected
@@ -104,7 +104,8 @@ class MKLDNNPoolingBwd { | |||
inline bool SupportMKLDNNPooling(const PoolingParam ¶m) { | |||
return param.kernel.ndim() == 2 && | |||
(param.pool_type == pool_enum::kMaxPooling || | |||
param.pool_type == pool_enum::kAvgPooling); | |||
param.pool_type == pool_enum::kAvgPooling) && | |||
(!param.layout.has_value() || param.layout.value() == mshadow::kNCHW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for changing this. Just to clarify, MKL-DNN pooling primitive does support NHWC input format. But when we integrated MKL-DNN backend into MXNet, we assumed that user input data should always be NCHW for 4D tesnsor. We need re-evaluated the workflow and integration to see if we need support NHWC here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaoLv are you tracking this separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This change looks good to me for now. Two possible follow-ups:
- if layout parameter will be removed in the next major release, possibly we will not support NHWC.
- otherwise, if needed, we will extend MKL-DNN integration to support NHWC. Currently I don't see any necessity for it.
Do you think I can put the suggestion for removing layout into #9686?
Maybe it's out of the scope of this PR, but I would suggest to remove layout parameters from all operator APIs in the next major release. It's framework that should take care of the performance of model. We cannot ask users to complicate their models to get some performance benefit from one specific hardware or backend. The change may be nightmare if they want to change back to other hardwares or backends. |
I agree with @TaoLv that a new approach to layout must wait until the next major (API-breaking) release. I look forward to the discussions of how to free users from coding layout considerations into their models and how to allow the framework to make the best backend-specific layout choices. I've attempted to address reviewer comments to-date. Requesting further input, e.g. from those additionally listed reviewers @zheng-da @szha @zhreshold. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to see some performance numbers for different flag combinations. If there's no visible regression, I think this PR is good to go for me. Then the community should propose a clear solution to resolve the conflict and make different backend has different option in the future. Also need discuss the API change for next major release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @DickJC123 . It's approved now.
// For reference see Fixed Issues section in | ||
// https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_721.html#rel_721 | ||
#if CUDNN_VERSION == 7104 | ||
is_supported = kernel_height <= 8 && kernel_width <= 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a CHECK here is probably more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to error-out here via CHECK() in this Supports() function, since there is a graceful fallback to the CUDA implementation.
@@ -81,7 +81,8 @@ class CuDNNPoolingOp { | |||
CHECK_EQ(s->dnn_handle_ownership_, mshadow::Stream<gpu>::OwnHandle); | |||
typename DataType<DType>::ScaleType alpha = 1.0f; | |||
typename DataType<DType>::ScaleType beta = 0.0f; | |||
this->Init(s, in_data, out_data); | |||
if (!this->Init(s, in_data, out_data)) | |||
LOG(FATAL) << "cuDNN Pooling invoked with unsupported parameters."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. otherwise it's hard to figure out where the problem is with this error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this tip. Updated per your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any other issues you'd like me to look at?
…#13749) * Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests. * Docs changes * Trigger * Skip NHWC pooling tests on non-cuDNN platforms * Fix pylint NHWC pooling * Fixes from review * Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return. * Add layout support to cpu implementation of Pooling, with tests. * Fix cpplint. * Fix bug in cpu nhwc impl. * Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooling. Tests. * Add PoolingParam::GetLayout() for better default layout handling. * Fix cpplint. * Throw exception for quantization pooling not NCHW. * Expand nhwc pooling test coverage. * SupportMKLDNNPooling() to examine layout param. * Compare 'std' and 'v1' pooling versions only when op definitions permit. * Add pooling test diagnostic output. * Fix syntax. * Fix pooling FInplaceOption so it can be shared by all implementations. * Add missing param definition. * Fix #if logic. * Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu. * Move back to dmlc/mshadow.git, now with float->half rounding. * Avoid underflow of lp pooling calc for dtype=float16. * Remove redundant pooling test. * Minor variable naming fixes. * Modify FInplaceOption handling per reviewer comments. Expand testing. * Correct gluon Pooling layout param description. * Correct Symbol Pooling description. * Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'. * Empty commit to trigger CI.
…#13749) * Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests. * Docs changes * Trigger * Skip NHWC pooling tests on non-cuDNN platforms * Fix pylint NHWC pooling * Fixes from review * Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return. * Add layout support to cpu implementation of Pooling, with tests. * Fix cpplint. * Fix bug in cpu nhwc impl. * Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooling. Tests. * Add PoolingParam::GetLayout() for better default layout handling. * Fix cpplint. * Throw exception for quantization pooling not NCHW. * Expand nhwc pooling test coverage. * SupportMKLDNNPooling() to examine layout param. * Compare 'std' and 'v1' pooling versions only when op definitions permit. * Add pooling test diagnostic output. * Fix syntax. * Fix pooling FInplaceOption so it can be shared by all implementations. * Add missing param definition. * Fix #if logic. * Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu. * Move back to dmlc/mshadow.git, now with float->half rounding. * Avoid underflow of lp pooling calc for dtype=float16. * Remove redundant pooling test. * Minor variable naming fixes. * Modify FInplaceOption handling per reviewer comments. Expand testing. * Correct gluon Pooling layout param description. * Correct Symbol Pooling description. * Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'. * Empty commit to trigger CI.
…#13749) * Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests. * Docs changes * Trigger * Skip NHWC pooling tests on non-cuDNN platforms * Fix pylint NHWC pooling * Fixes from review * Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return. * Add layout support to cpu implementation of Pooling, with tests. * Fix cpplint. * Fix bug in cpu nhwc impl. * Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooling. Tests. * Add PoolingParam::GetLayout() for better default layout handling. * Fix cpplint. * Throw exception for quantization pooling not NCHW. * Expand nhwc pooling test coverage. * SupportMKLDNNPooling() to examine layout param. * Compare 'std' and 'v1' pooling versions only when op definitions permit. * Add pooling test diagnostic output. * Fix syntax. * Fix pooling FInplaceOption so it can be shared by all implementations. * Add missing param definition. * Fix #if logic. * Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu. * Move back to dmlc/mshadow.git, now with float->half rounding. * Avoid underflow of lp pooling calc for dtype=float16. * Remove redundant pooling test. * Minor variable naming fixes. * Modify FInplaceOption handling per reviewer comments. Expand testing. * Correct gluon Pooling layout param description. * Correct Symbol Pooling description. * Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'. * Empty commit to trigger CI.
…#13749) * Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests. * Docs changes * Trigger * Skip NHWC pooling tests on non-cuDNN platforms * Fix pylint NHWC pooling * Fixes from review * Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return. * Add layout support to cpu implementation of Pooling, with tests. * Fix cpplint. * Fix bug in cpu nhwc impl. * Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooling. Tests. * Add PoolingParam::GetLayout() for better default layout handling. * Fix cpplint. * Throw exception for quantization pooling not NCHW. * Expand nhwc pooling test coverage. * SupportMKLDNNPooling() to examine layout param. * Compare 'std' and 'v1' pooling versions only when op definitions permit. * Add pooling test diagnostic output. * Fix syntax. * Fix pooling FInplaceOption so it can be shared by all implementations. * Add missing param definition. * Fix #if logic. * Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu. * Move back to dmlc/mshadow.git, now with float->half rounding. * Avoid underflow of lp pooling calc for dtype=float16. * Remove redundant pooling test. * Minor variable naming fixes. * Modify FInplaceOption handling per reviewer comments. Expand testing. * Correct gluon Pooling layout param description. * Correct Symbol Pooling description. * Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'. * Empty commit to trigger CI.
Description
This is a continuation of PR #13362 started by @ptrendx titled "Add NHWC layout support to Pooling (cuDNN only)". We're in agreement that I'm the best one to push this work forward to a point of final acceptance.
Based on reviewer comments, I've expanded the support of the newly introduced 'layout' parameter to include cpu and cuda implementations, not just cudnn. Also 3D support in cuDNN and graceful fallback for MKLDNN. I've re-enabled and expanded the testing of test_operator_gpu.py:test_pooling_versions. I understand and avoid the earlier flakiness of that test, which was due to inclusion of the pooling_v1 operator in some problem domains where its definition differs. The initial commit here is identical to the last one made by @ptrendx for PR #13362. Follow-up commits will add this new functionality and additional description.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments