Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

DickJC123 · 2018-12-31T01:11:00Z

Description

This is a continuation of PR #13362 started by @ptrendx titled "Add NHWC layout support to Pooling (cuDNN only)". We're in agreement that I'm the best one to push this work forward to a point of final acceptance.

Based on reviewer comments, I've expanded the support of the newly introduced 'layout' parameter to include cpu and cuda implementations, not just cudnn. Also 3D support in cuDNN and graceful fallback for MKLDNN. I've re-enabled and expanded the testing of test_operator_gpu.py:test_pooling_versions. I understand and avoid the earlier flakiness of that test, which was due to inclusion of the pooling_v1 operator in some problem domains where its definition differs. The initial commit here is identical to the last one made by @ptrendx for PR #13362. Follow-up commits will add this new functionality and additional description.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

…return.

…ng. Tests.

anirudhacharya · 2019-01-14T22:35:09Z

@mxnet-label-bot add [pr-awaiting-review]

…st on cpu.

apeforest · 2019-01-31T23:52:04Z

src/operator/nn/pool_utils.h

  }
 };

 template<typename DType>
 struct lp_grad<DType, 3> {
  static MSHADOW_XINLINE DType Map(const DType grad, const DType in_data, const DType out_data) {
-    return grad * in_data * in_data / (out_data * out_data);
+    // Avoid nan result if both grad and out_data are 0.
+    return (grad == DType(0.0)) ? DType(0.0) : grad * in_data * in_data / (out_data * out_data);


Why not also check out_data == DType(0.0) here if it can be zero, to avoid overflow?

I introduced this to quiet some nans that were causing test failures. I'm reviewing the specifics to see if a better solution exists as suggested in your comment, so stay tuned.

I've pushed my solution to your comment in commit 098bc49.

The checking of grad == 0.0 that you highlighted only succeeded because of the quirks of our check_consistency() routine in test_utils.py, which uses the symbol forward() output as the gradient. Per your suggestion, I'm now using a check of out_data == 0 as the more general way of quieting the test failures.

The test failures I was seeing often occurred in float16 lp-3 pooling. By example, take the case where a pool window of 2 has identical inputs 2^-9 and 2^-9. The forward output for this case is the cube root of (2^-9)^3 + (2^-9)^3. If this is calculated in float16, the 2^-27 terms underflow to 0 and the output is 0. The backward output is then grad * 2^-9 * 2^-9 / (0 * 0) = +inf (or nan if grad is also 0). When performed in float32, no underflow occurs in the forward op, and +infs are avoided in the backward op.

My conclusion: float16 is ill-equipped to perform the forward pooling operation for lp-2 and lp-3. Part of my solution here thus involves promoting the calculation to be in float32 for cpu and mxnet cuda implementations of float16-i/o pooling. This is consistent with other operators like float16-i/o Convolution and Batchnorm, which perform internal calculations in float32. I've run test_pooling_versions() thousands of times with no failures in this mode.

TaoLv · 2019-02-02T14:13:57Z

src/operator/nn/pooling.cc

-  return std::vector<std::pair<int, int> >();
-#else
-  return std::vector<std::pair<int, int> >{{1, 0}};
+#if MXNET_USE_MKLDNN == 1 && MXNET_USE_CUDA == 0 && MXNET_USE_CUDNN == 0


Ummm, do you have any case to validate this scenario? If MXNet is built with both MKL-DNN and cuDNN enabled (eg. mxnet-cu90mkl package) and user gives ctx=cpu. Is there any functionality or performance regression?

So unfortunately, all enabled backend operator implementations must live with the same FInPlaceOptions. I voiced concern about this restriction to @eric-haibin-lin and he graciously put some thought into: #13598. The issue has seen no activity in a month, but I'd like to see it reawakened.

Until then, the code you highlight enforces that the most conservatinve "least common denominator" FInPlaceOptions (i.e. none) are in effect for the enabled Pooling implementations. Since an operator cannot rely on FInPlace actually occurring, there should be no functional regression. And perf-wise, a user must understand that at this point enabling multiple back-ends may have some performance consequences due to the FInPlaceOption implementation.

Sorry for any confusion. Perhaps my comments in the code for FInplaceOption speak to the issue more directly:

// Since this routine is not provided the cpu/gpu context info, only in the case // where CUDA and CUDNN implementations are not available can we be sure the MKLDNN // implementation will be employed. The MKLDNN FInplaceOptions are not compatible // with the other (i.e. cpu, cuda and cudnn) implementations.

Thanks for the explanation. I understand the current limitation in here.
My concern is the performance regression in the CPU side when mxnet-cuXXmkl binary is used which actually is the widely used binary like sockeye, gluonNLP, gluonCV, etc.
Before we resolved issue #13598 (btw, we're working on this now）, I think we need to keep this line AS-IS.

mxnet-cuXX-mkl pip binary statically links cudnn and will not be affected. @szha correct me if I am wrong.

The potentially affected cases are:

build mxnet with mkldnn and cuda, but without cudnn

build mxnet with neither mkldnn nor cuda.

As this operator is used in backward for training, I see the two cases above are rare. Therefore, I do not think this is a blocking issue for this PR.

Nonetheless, I suggest @DickJC123 create a github issue with this change so that people are aware and can be tracked later, and preferably run some performance testing to have a rough idea how big an impact it is making for the above two cases.

Let's continue discussions on #13598 to avoid such problems in the future.

I originally set out to create a test case (and separate issue) to demonstrate the need for the restrictive #if clause I put around the MKLDNN FInplaceOption. I was unable to do so and in the process I came to realize that I misunderstood the meaning behind the {1,0} option. In fact, it allows the framework to share the backward '1' input (the unused workspace gradient) with the backward '0' output (the pooling input gradient) where the additional MKLDNN workspace is present. This sharing is compatible with CUDA and cuDNN Pooling implementations, which do not use a workspace or its gradient! I'm therefore reverting back to a simple code guard around the MKLDNN FinplaceOption- one whose logic mirrors the other MKLDNN/no-MKLDNN switches of behavior within nn/pooling.cc. No new issue is required. With my last commit I've also expanded the pooling tests to cover cases that have the inplace tensor sharing, which requires the pooling input to be the same size as the output (and also then the workspace).

Good to know MKLDNN impl is not affected

tests/python/gpu/test_operator_gpu.py

TaoLv · 2019-02-02T14:26:52Z

src/operator/nn/mkldnn/mkldnn_pooling-inl.h

@@ -104,7 +104,8 @@ class MKLDNNPoolingBwd {
 inline bool SupportMKLDNNPooling(const PoolingParam &param) {
  return param.kernel.ndim() == 2 &&
         (param.pool_type == pool_enum::kMaxPooling ||
-          param.pool_type == pool_enum::kAvgPooling);
+          param.pool_type == pool_enum::kAvgPooling) &&
+         (!param.layout.has_value() || param.layout.value() == mshadow::kNCHW);


Thank you for changing this. Just to clarify, MKL-DNN pooling primitive does support NHWC input format. But when we integrated MKL-DNN backend into MXNet, we assumed that user input data should always be NCHW for 4D tesnsor. We need re-evaluated the workflow and integration to see if we need support NHWC here.

@TaoLv are you tracking this separately?

Yes. This change looks good to me for now. Two possible follow-ups:

if layout parameter will be removed in the next major release, possibly we will not support NHWC.

otherwise, if needed, we will extend MKL-DNN integration to support NHWC. Currently I don't see any necessity for it.

Do you think I can put the suggestion for removing layout into #9686?

TaoLv · 2019-02-02T14:39:43Z

Maybe it's out of the scope of this PR, but I would suggest to remove layout parameters from all operator APIs in the next major release. It's framework that should take care of the performance of model. We cannot ask users to complicate their models to get some performance benefit from one specific hardware or backend. The change may be nightmare if they want to change back to other hardwares or backends.

DickJC123 · 2019-02-06T01:10:20Z

I agree with @TaoLv that a new approach to layout must wait until the next major (API-breaking) release. I look forward to the discussions of how to free users from coding layout considerations into their models and how to allow the framework to make the best backend-specific layout choices.

I've attempted to address reviewer comments to-date. Requesting further input, e.g. from those additionally listed reviewers @zheng-da @szha @zhreshold. Thanks in advance!

src/operator/nn/cudnn/cudnn_pooling-inl.h

TaoLv

I still want to see some performance numbers for different flag combinations. If there's no visible regression, I think this PR is good to go for me. Then the community should propose a clear solution to resolve the conflict and make different backend has different option in the future. Also need discuss the API change for next major release.

src/operator/nn/pooling.cc

TaoLv

Thank you @DickJC123 . It's approved now.

python/mxnet/gluon/nn/conv_layers.py

szha · 2019-02-15T04:54:23Z

src/operator/nn/cudnn/cudnn_pooling-inl.h

+      // For reference see Fixed Issues section in
+      // https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_721.html#rel_721
+      #if CUDNN_VERSION == 7104
+      is_supported = kernel_height <= 8 && kernel_width <= 8;


a CHECK here is probably more informative.

We don't want to error-out here via CHECK() in this Supports() function, since there is a graceful fallback to the CUDA implementation.

szha · 2019-02-15T04:55:04Z

src/operator/nn/cudnn/cudnn_pooling-inl.h

@@ -81,7 +81,8 @@ class CuDNNPoolingOp {
    CHECK_EQ(s->dnn_handle_ownership_, mshadow::Stream<gpu>::OwnHandle);
    typename DataType<DType>::ScaleType alpha = 1.0f;
    typename DataType<DType>::ScaleType beta = 0.0f;
-    this->Init(s, in_data, out_data);
+    if (!this->Init(s, in_data, out_data))
+      LOG(FATAL) << "cuDNN Pooling invoked with unsupported parameters.";


e.g. otherwise it's hard to figure out where the problem is with this error message.

Thanks for this tip. Updated per your suggestion.

Any other issues you'd like me to look at?

…#13749) * Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests. * Docs changes * Trigger * Skip NHWC pooling tests on non-cuDNN platforms * Fix pylint NHWC pooling * Fixes from review * Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return. * Add layout support to cpu implementation of Pooling, with tests. * Fix cpplint. * Fix bug in cpu nhwc impl. * Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooling. Tests. * Add PoolingParam::GetLayout() for better default layout handling. * Fix cpplint. * Throw exception for quantization pooling not NCHW. * Expand nhwc pooling test coverage. * SupportMKLDNNPooling() to examine layout param. * Compare 'std' and 'v1' pooling versions only when op definitions permit. * Add pooling test diagnostic output. * Fix syntax. * Fix pooling FInplaceOption so it can be shared by all implementations. * Add missing param definition. * Fix #if logic. * Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu. * Move back to dmlc/mshadow.git, now with float->half rounding. * Avoid underflow of lp pooling calc for dtype=float16. * Remove redundant pooling test. * Minor variable naming fixes. * Modify FInplaceOption handling per reviewer comments. Expand testing. * Correct gluon Pooling layout param description. * Correct Symbol Pooling description. * Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'. * Empty commit to trigger CI.

DickJC123 and others added 6 commits November 21, 2018 14:14

Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests.

cf0dc09

Docs changes

55ebfb7

Trigger

79e07fd

Skip NHWC pooling tests on non-cuDNN platforms

4e0d0e9

Fix pylint NHWC pooling

c7e06a9

Fixes from review

7fcbdac

DickJC123 requested a review from szha as a code owner December 31, 2018 01:11

ptrendx mentioned this pull request Dec 31, 2018

Add NHWC layout support to Pooling (cuDNN only) #13362

Closed

4 tasks

DickJC123 added 11 commits December 30, 2018 17:45

Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool …

1d19ae5

…return.

Add layout support to cpu implementation of Pooling, with tests.

6d5d8cb

Fix cpplint.

934f84c

Fix bug in cpu nhwc impl.

73fbae9

Add MXNet CUDA pooling in NWC, NHWC and NDHWC. Turn on 3D cuDNN pooli…

8654d50

…ng. Tests.

Add PoolingParam::GetLayout() for better default layout handling.

78d08a1

Fix cpplint.

226854b

Throw exception for quantization pooling not NCHW.

b08747b

Expand nhwc pooling test coverage.

ed31a60

SupportMKLDNNPooling() to examine layout param.

0dd3fba

Compare 'std' and 'v1' pooling versions only when op definitions permit.

0625542

sandeep-krishnamurthy added CUDA Backend Issues related to the backend of MXNet labels Dec 31, 2018

DickJC123 added 5 commits January 1, 2019 15:23

Add pooling test diagnostic output.

19d9453

Fix syntax.

086ba9e

Fix pooling FInplaceOption so it can be shared by all implementations.

e913e0f

Add missing param definition.

179a5bf

Fix #if logic.

21336d9

DickJC123 mentioned this pull request Jan 12, 2019

float32 -> float16 cast consistency across implementations #13857

Merged

5 tasks

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 14, 2019

Temp switch to DickJC123/mshadow: shows effect of half round-to-neare…

403bd70

…st on cpu.

apeforest reviewed Jan 31, 2019

View reviewed changes

szha requested review from zheng-da, TaoLv and zhreshold February 2, 2019 00:51

TaoLv reviewed Feb 2, 2019

View reviewed changes

DickJC123 added 2 commits February 4, 2019 16:56

Avoid underflow of lp pooling calc for dtype=float16.

098bc49

Remove redundant pooling test.

0627df4

apeforest reviewed Feb 6, 2019

View reviewed changes

src/operator/nn/cudnn/cudnn_pooling-inl.h Outdated Show resolved Hide resolved

apeforest reviewed Feb 6, 2019

View reviewed changes

src/operator/nn/cudnn/cudnn_pooling-inl.h Outdated Show resolved Hide resolved

Minor variable naming fixes.

16bfef0

TaoLv reviewed Feb 12, 2019

View reviewed changes

src/operator/nn/pooling.cc Outdated Show resolved Hide resolved

Modify FInplaceOption handling per reviewer comments. Expand testing.

7506c1b

TaoLv approved these changes Feb 14, 2019

View reviewed changes

szha reviewed Feb 14, 2019

View reviewed changes

python/mxnet/gluon/nn/conv_layers.py Outdated Show resolved Hide resolved

DickJC123 added 2 commits February 14, 2019 18:16

Correct gluon Pooling layout param description.

7ac90ee

Correct Symbol Pooling description.

a2a2103

ptrendx approved these changes Feb 15, 2019

View reviewed changes

szha reviewed Feb 15, 2019

View reviewed changes

DickJC123 added 2 commits February 15, 2019 09:52

Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'.

a86be07

Empty commit to trigger CI.

f773a90

szha approved these changes Feb 16, 2019

View reviewed changes

eric-haibin-lin merged commit 5adb6fc into apache:master Feb 16, 2019

PawelGlomski-Intel mentioned this pull request Jun 14, 2021

More fine-grained operator implementation dispatch & memory planning flow #13598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

DickJC123 commented Dec 31, 2018

anirudhacharya commented Jan 14, 2019

apeforest Jan 31, 2019

DickJC123 Feb 1, 2019

DickJC123 Feb 6, 2019

TaoLv Feb 2, 2019

DickJC123 Feb 5, 2019

DickJC123 Feb 6, 2019

pengzhao-intel Feb 11, 2019

eric-haibin-lin Feb 12, 2019 •

edited

Loading

DickJC123 Feb 13, 2019

eric-haibin-lin Feb 13, 2019

TaoLv Feb 2, 2019

szha Feb 15, 2019

TaoLv Feb 15, 2019

TaoLv commented Feb 2, 2019

DickJC123 commented Feb 6, 2019

TaoLv left a comment

TaoLv left a comment

szha Feb 15, 2019

DickJC123 Feb 15, 2019

szha Feb 15, 2019

DickJC123 Feb 15, 2019

DickJC123 Feb 16, 2019

Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

Conversation

DickJC123 commented Dec 31, 2018

Description

Checklist

Essentials

Changes

Comments

anirudhacharya commented Jan 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin Feb 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv commented Feb 2, 2019

DickJC123 commented Feb 6, 2019

TaoLv left a comment

Choose a reason for hiding this comment

TaoLv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin Feb 12, 2019 •

edited

Loading