Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add bfloat16 floating-point format support based on AMP #17265

Merged
merged 61 commits into from
Feb 16, 2020
Merged

Add bfloat16 floating-point format support based on AMP #17265

merged 61 commits into from
Feb 16, 2020

Conversation

rongzha1
Copy link
Contributor

Description

Bfloat16 is wildly used in Deep Learning especially on training to get a better performance.

This PR is to add bf16 support based on mxnet AMP(Automatic Mixed Precision) module .

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • [done ] Changes are complete (i.e. I finished coding on this PR)
  • [done ] All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [ done] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

This PR has passed unitest and preci test in local machine.
Unit tests are added for this PR.

@ZhennanQin @ElaineBao @xinyu-intel @TaoLv @PatricZhao

.gitmodules Outdated
@@ -6,7 +6,7 @@
url = /~https://github.com/dmlc/ps-lite
[submodule "3rdparty/dlpack"]
path = 3rdparty/dlpack
url = /~https://github.com/dmlc/dlpack
url = /~https://github.com/ElaineBao/dlpack.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to be changed once the changes are merged to upstream dlpack. Leaving this comment as a reminder ;-).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely :) We're working on PR the related code in dlpack dmlc/dlpack#45

Comment on lines +341 to +344
kInt16 = 8,
kUint16 = 9,
kUint32 = 10,
kUint64 = 11,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding those additional types here? No operator supports them anyway, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to align the definition with DLPack. Otherwise we have to preserve those numbers. Even through we don't use them currently, it's no harm to add them.

Comment on lines +29 to +55
MSHADOW_BF16_OPERATOR_TYPE(float, float, OP) \
MSHADOW_BF16_OPERATOR_TYPE(double, double, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, int8_t, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, uint8_t, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, int32_t, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, uint32_t, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, int64_t, OP) \
MSHADOW_BF16_OPERATOR_TYPE(float, uint64_t, OP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning float or double, while understandable, is a different behavior to the one currently done for half_t type. Could we discuss this and make them consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Any suggestion here?

Comment on lines +105 to +122
if model_name.find('imagenet1k-resnet-152') != -1:
excluded_sym_names += ['conv0']
elif model_name.find('imagenet1k-inception-bn') != -1:
excluded_sym_names += ['conv_1']
elif model_name.find('resnet') != -1 and model_name.find('v1') != -1:
excluded_sym_names += ['resnetv10_conv0_fwd']
elif model_name.find('resnet') != -1 and model_name.find('v2') != -1:
excluded_sym_names += ['resnetv20_conv0_fwd']
elif model_name.find('vgg') != -1:
excluded_sym_names += ['vgg0_conv0_fwd']
elif model_name.find('squeezenet1') != -1:
excluded_sym_names += ['squeezenet0_conv0_fwd']
elif model_name.find('mobilenet') != -1 and model_name.find('v2') == -1:
excluded_sym_names += ['mobilenet0_conv0_fwd']
elif model_name.find('mobilenet') != -1 and model_name.find('v2') != -1:
excluded_sym_names += ['mobilenetv20_conv0_fwd']
elif model_name.find('inceptionv3') != -1:
excluded_sym_names += ['inception30_conv0_fwd']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Is there an accuracy issue without those exclusions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for accuracy, but for performance purpose. This could be removed once more bfloat16 hardware added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment for this temp performance solution and will convert all conv layers later.

@@ -43,14 +44,17 @@
from ... import optimizer as opt
from .loss_scaler import LossScaler

bfloat16 = np.dtype([('bfloat16', np.uint16)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this dtype accessible (as mx.bfloat16 or something similar)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good topic, and I want to have a discussion for this.
Currently, MXNet doesn't have its own type system. It's simply using Numpy.dtype. Numpy doesn't natively support bfloat16, so we define bfloat16 as a numpy customized type.
Pros: compatible with current design, isinstance(bfloat16, np.dtype) could return True.
cons: bfloat16.name doesn't work, have to use bfloat16.names[0] instead.
Another solution is, creating MXNet's own data type system, just like pytorch and tf. This is a big API change, so we wish this can be done when upgrading to MXNet 2.0.

Currently, we prefer this approach to enable bfloat16 in MXNet 1.x, and refactor it in MXNet 2.0.

Comment on lines +292 to +302
if data.dtype == np.dtype([('bfloat16', np.uint16)]):
assert np.dtype(self.dtype) == data.dtype, \
"Failed loading Parameter '%s' from saved params: " \
"dtype incompatible expected %s vs saved %s. " \
"Set cast_dtype=True to cast the dtype of saved params."%(
self.name, str(self.dtype), str(data.dtype))
else:
assert np.dtype(self.dtype).type == data.dtype, \
"Failed loading Parameter '%s' from saved params: " \
"dtype incompatible expected %s vs saved %s. " \
"Set cast_dtype=True to cast the dtype of saved params."%(
self.name, str(self.dtype), str(data.dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't those 2 codepaths the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* This creates a new NDArray using f32 with the reordered data.
* It doesn't affect the data of the original NDArray.
*/
NDArray Reorder2DefaultFp32() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding dtype specific interface looks very ad-hoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will change it to Reorder2DefaultFloatFormat()

@@ -83,6 +84,7 @@
5: np.int8,
6: np.int64,
7: np.bool_,
12: np.dtype([('bfloat16', np.uint16)]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 12?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to align the TypeFlag defined in mshadow

src/engine/naive_engine.cc Show resolved Hide resolved
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bfloat16 training, which loss scalar is recommended? Do we also need to perform NaN checks?

@ElaineBao
Copy link
Contributor

For bfloat16 training, which loss scalar is recommended? Do we also need to perform NaN checks?

Bfloat16 has the same dynamic range as float32, since they have the same exponent bits. So it can represent gradients directly, it doesn't require loss scaling like fp16.

@@ -988,6 +1034,7 @@ struct minimum {
};
} // namespace red

#ifndef __NVCC__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it - can we make a similar thing as in the case of fp16 for CPUs that did not support F16C instructions (i.e. code that runs but may be slower than the code for hardware natively supporting bfloat16)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can implement atomicAdd (which seems to be the problem you are facing)with atomicCAS like this: /~https://github.com/apache/incubator-mxnet/blob/master/src/common/cuda_utils.h#L702-L721

Copy link
Contributor

@ZhennanQin ZhennanQin Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have enough background / knowledge to enable Bfloat16 on GPU side. So probably we can't make the change you proposed. Alternately, any code refactoring on GPU side is welcome. you may change this as you want in following PR.

@eric-haibin-lin
Copy link
Member

@ElaineBao thanks for the explanation

.gitmodules Outdated
@@ -6,7 +6,7 @@
url = /~https://github.com/dmlc/ps-lite
[submodule "3rdparty/dlpack"]
path = 3rdparty/dlpack
url = /~https://github.com/dmlc/dlpack
url = /~https://github.com/dmlc/dlpack.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -237,7 +238,7 @@ static bool BackwardFCStorageType(const nnvm::NodeAttrs& attrs,
bool dispatched = false;
if (!dispatched && common::ContainsOnlyStorage(*in_attrs, mxnet::kDefaultStorage)) {
dispatched = storage_type_assign(out_attrs, mxnet::kDefaultStorage,
dispatch_mode, DispatchMode::kFCompute);
dispatch_mode, DispatchMode::kFComputeEx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to enable dnnl fc bwd in another PR since there is an known issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminder

@pengzhao-intel
Copy link
Contributor

@ptrendx thanks for your review. Feel free to let me know if you have other concerns, we are going to merge PR recently.

@TaoLv
Copy link
Member

TaoLv commented Feb 14, 2020

Hi @leezu, @larroy, this PR can pass the builds for ARM but always hit the time out of the test.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fedge/detail/PR-17265/29/pipeline

We don't have any environment to reproduce. Could you please take a look or have any suggestion for further debugging?

@@ -0,0 +1,167 @@
/*!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha Do we need Apache license header for this new file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

@zhreshold
Copy link
Member

@larroy Do you have idea how to display more logs for the edge tests? It consistently fail at this stage.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaoLv @ZhennanQin @ciyongch @eric-haibin-lin @szha please take a final review for this PR.

If no further comments, I will merge the PR in tomorrow :)

@pengzhao-intel
Copy link
Contributor

I am merging now. If there're any other comments, we can resolve by new PR.

@pengzhao-intel pengzhao-intel merged commit 065d48e into apache:master Feb 16, 2020
zheyuye pushed a commit to zheyuye/incubator-mxnet that referenced this pull request Feb 19, 2020
* Add Bfloat16

* mshadow support bf16

* rebase bf16 mkldnn1.0

* support bf16 gemm

* resolve fp32 ip bwd bug

* add other bf16 ops

* change func name from fp16 to lp16 (low precision 16), to include bf16

* add amp_cast bf16 support for ndarray

* fix executor copy_params

* add test case for bf16

* remove numpy dtype hook for bf16

* add bf16 type support

* rebase to mxnet master

* add single conv test

* fix symbolic inference

* add dtype check when copy

* add single conv and bn test

* skip fp16 amp_cast test in cpu

* Fix resnet50 first convolution

* Skip first convolution for bfloat16

* support bf16 fallback compute

* recover origin test

* add some bf16 unittests

* fix bf16 bn test, enhance assert_almost_equal_with_err

* using assert_almost_equal_with_err for fallback bn test

* add relu6 bf16 support

* fix lint

* fix subgraph conv with data=0

* mkldnn doesn't support 0 dim tensor

* rm dtype check when copy

* using bf16 tvm

* rm bf16 mnist demo

* use official tvm

* change function name; fix lint error

* fix clang check error:conditional expression is ambiguous; 'float' can be converted to 'mshadow::bfloat::bf16_t' and vice versa

* nvcc compiler build pass

* fix gpu amp cast symbol error

* fix mnist training error

* fix cpp test: Engine.VarVersion error

* workaround cpp failed test mkldnn fc bwd

* to fix mkldnn test_mkldnn_ndarray_slice error

* 1. move some code from to np_broadcast_reduce_op_value.cc to np_broadcast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2)
2. rm debug code

* use official dlpack

* rename np_broadcast_reduce_op_value_part2.cc and add some description

* 1. update dlpack url in .gitmodule
2. disable mkldnn fc bwd

* fix remaining NodePtr due to tvm update

* mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN compiler error 'fatal error C1002: compiler is out of heap space in pass 2'

* fix WIN CPU build fail:compiler is out of heap space in pass 2

* fix WIN build fail

* fix lint

* add print for test bf16_concat

* fix bf16 test fail

* disable bf16 concat test

* tmp skip to root cause edge test halt

* fix bf16_bn test error

* enable test_bulk

* tmp rm bf16 to locate edge error

* Revert "tmp rm bf16 to locate edge error"

This reverts commit 7360246.

* add Apache license header

* trigger CI

* add robust for test bf16 bn

Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>
Co-authored-by: YixinBao <yixin.bao@intel.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: Wuxun Zhang <wuxun.zhang@intel.com>
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* Add Bfloat16

* mshadow support bf16

* rebase bf16 mkldnn1.0

* support bf16 gemm

* resolve fp32 ip bwd bug

* add other bf16 ops

* change func name from fp16 to lp16 (low precision 16), to include bf16

* add amp_cast bf16 support for ndarray

* fix executor copy_params

* add test case for bf16

* remove numpy dtype hook for bf16

* add bf16 type support

* rebase to mxnet master

* add single conv test

* fix symbolic inference

* add dtype check when copy

* add single conv and bn test

* skip fp16 amp_cast test in cpu

* Fix resnet50 first convolution

* Skip first convolution for bfloat16

* support bf16 fallback compute

* recover origin test

* add some bf16 unittests

* fix bf16 bn test, enhance assert_almost_equal_with_err

* using assert_almost_equal_with_err for fallback bn test

* add relu6 bf16 support

* fix lint

* fix subgraph conv with data=0

* mkldnn doesn't support 0 dim tensor

* rm dtype check when copy

* using bf16 tvm

* rm bf16 mnist demo

* use official tvm

* change function name; fix lint error

* fix clang check error:conditional expression is ambiguous; 'float' can be converted to 'mshadow::bfloat::bf16_t' and vice versa

* nvcc compiler build pass

* fix gpu amp cast symbol error

* fix mnist training error

* fix cpp test: Engine.VarVersion error

* workaround cpp failed test mkldnn fc bwd

* to fix mkldnn test_mkldnn_ndarray_slice error

* 1. move some code from to np_broadcast_reduce_op_value.cc to np_broadcast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2)
2. rm debug code

* use official dlpack

* rename np_broadcast_reduce_op_value_part2.cc and add some description

* 1. update dlpack url in .gitmodule
2. disable mkldnn fc bwd

* fix remaining NodePtr due to tvm update

* mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN compiler error 'fatal error C1002: compiler is out of heap space in pass 2'

* fix WIN CPU build fail:compiler is out of heap space in pass 2

* fix WIN build fail

* fix lint

* add print for test bf16_concat

* fix bf16 test fail

* disable bf16 concat test

* tmp skip to root cause edge test halt

* fix bf16_bn test error

* enable test_bulk

* tmp rm bf16 to locate edge error

* Revert "tmp rm bf16 to locate edge error"

This reverts commit 7360246.

* add Apache license header

* trigger CI

* add robust for test bf16 bn

Co-authored-by: Zhennan Qin <zhennan.qin@intel.com>
Co-authored-by: YixinBao <yixin.bao@intel.com>
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com>
Co-authored-by: Wuxun Zhang <wuxun.zhang@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.