-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add bfloat16 floating-point format support based on AMP #17265
Conversation
.gitmodules
Outdated
@@ -6,7 +6,7 @@ | |||
url = /~https://github.com/dmlc/ps-lite | |||
[submodule "3rdparty/dlpack"] | |||
path = 3rdparty/dlpack | |||
url = /~https://github.com/dmlc/dlpack | |||
url = /~https://github.com/ElaineBao/dlpack.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to be changed once the changes are merged to upstream dlpack. Leaving this comment as a reminder ;-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely :) We're working on PR the related code in dlpack dmlc/dlpack#45
kInt16 = 8, | ||
kUint16 = 9, | ||
kUint32 = 10, | ||
kUint64 = 11, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding those additional types here? No operator supports them anyway, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to align the definition with DLPack. Otherwise we have to preserve those numbers. Even through we don't use them currently, it's no harm to add them.
MSHADOW_BF16_OPERATOR_TYPE(float, float, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(double, double, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, int8_t, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, uint8_t, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, int32_t, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, uint32_t, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, int64_t, OP) \ | ||
MSHADOW_BF16_OPERATOR_TYPE(float, uint64_t, OP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning float or double, while understandable, is a different behavior to the one currently done for half_t type. Could we discuss this and make them consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Any suggestion here?
if model_name.find('imagenet1k-resnet-152') != -1: | ||
excluded_sym_names += ['conv0'] | ||
elif model_name.find('imagenet1k-inception-bn') != -1: | ||
excluded_sym_names += ['conv_1'] | ||
elif model_name.find('resnet') != -1 and model_name.find('v1') != -1: | ||
excluded_sym_names += ['resnetv10_conv0_fwd'] | ||
elif model_name.find('resnet') != -1 and model_name.find('v2') != -1: | ||
excluded_sym_names += ['resnetv20_conv0_fwd'] | ||
elif model_name.find('vgg') != -1: | ||
excluded_sym_names += ['vgg0_conv0_fwd'] | ||
elif model_name.find('squeezenet1') != -1: | ||
excluded_sym_names += ['squeezenet0_conv0_fwd'] | ||
elif model_name.find('mobilenet') != -1 and model_name.find('v2') == -1: | ||
excluded_sym_names += ['mobilenet0_conv0_fwd'] | ||
elif model_name.find('mobilenet') != -1 and model_name.find('v2') != -1: | ||
excluded_sym_names += ['mobilenetv20_conv0_fwd'] | ||
elif model_name.find('inceptionv3') != -1: | ||
excluded_sym_names += ['inception30_conv0_fwd'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Is there an accuracy issue without those exclusions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for accuracy, but for performance purpose. This could be removed once more bfloat16 hardware added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment for this temp performance solution and will convert all conv layers later.
@@ -43,14 +44,17 @@ | |||
from ... import optimizer as opt | |||
from .loss_scaler import LossScaler | |||
|
|||
bfloat16 = np.dtype([('bfloat16', np.uint16)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this dtype accessible (as mx.bfloat16
or something similar)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good topic, and I want to have a discussion for this.
Currently, MXNet doesn't have its own type system. It's simply using Numpy.dtype. Numpy doesn't natively support bfloat16, so we define bfloat16 as a numpy customized type.
Pros: compatible with current design, isinstance(bfloat16, np.dtype) could return True.
cons: bfloat16.name doesn't work, have to use bfloat16.names[0] instead.
Another solution is, creating MXNet's own data type system, just like pytorch and tf. This is a big API change, so we wish this can be done when upgrading to MXNet 2.0.
Currently, we prefer this approach to enable bfloat16 in MXNet 1.x, and refactor it in MXNet 2.0.
if data.dtype == np.dtype([('bfloat16', np.uint16)]): | ||
assert np.dtype(self.dtype) == data.dtype, \ | ||
"Failed loading Parameter '%s' from saved params: " \ | ||
"dtype incompatible expected %s vs saved %s. " \ | ||
"Set cast_dtype=True to cast the dtype of saved params."%( | ||
self.name, str(self.dtype), str(data.dtype)) | ||
else: | ||
assert np.dtype(self.dtype).type == data.dtype, \ | ||
"Failed loading Parameter '%s' from saved params: " \ | ||
"dtype incompatible expected %s vs saved %s. " \ | ||
"Set cast_dtype=True to cast the dtype of saved params."%( | ||
self.name, str(self.dtype), str(data.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't those 2 codepaths the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.dtype(self.dtype) is different from np.dtype(self.dtype).type
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.type.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html#numpy.dtype
include/mxnet/ndarray.h
Outdated
* This creates a new NDArray using f32 with the reordered data. | ||
* It doesn't affect the data of the original NDArray. | ||
*/ | ||
NDArray Reorder2DefaultFp32() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding dtype specific interface looks very ad-hoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will change it to Reorder2DefaultFloatFormat()
@@ -83,6 +84,7 @@ | |||
5: np.int8, | |||
6: np.int64, | |||
7: np.bool_, | |||
12: np.dtype([('bfloat16', np.uint16)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to align the TypeFlag
defined in mshadow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For bfloat16 training, which loss scalar is recommended? Do we also need to perform NaN checks?
Bfloat16 has the same dynamic range as float32, since they have the same exponent bits. So it can represent gradients directly, it doesn't require loss scaling like fp16. |
@@ -988,6 +1034,7 @@ struct minimum { | |||
}; | |||
} // namespace red | |||
|
|||
#ifndef __NVCC__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it - can we make a similar thing as in the case of fp16 for CPUs that did not support F16C instructions (i.e. code that runs but may be slower than the code for hardware natively supporting bfloat16)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can implement atomicAdd
(which seems to be the problem you are facing)with atomicCAS
like this: /~https://github.com/apache/incubator-mxnet/blob/master/src/common/cuda_utils.h#L702-L721
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have enough background / knowledge to enable Bfloat16 on GPU side. So probably we can't make the change you proposed. Alternately, any code refactoring on GPU side is welcome. you may change this as you want in following PR.
@ElaineBao thanks for the explanation |
.gitmodules
Outdated
@@ -6,7 +6,7 @@ | |||
url = /~https://github.com/dmlc/ps-lite | |||
[submodule "3rdparty/dlpack"] | |||
path = 3rdparty/dlpack | |||
url = /~https://github.com/dmlc/dlpack | |||
url = /~https://github.com/dmlc/dlpack.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
src/operator/nn/fully_connected.cc
Outdated
@@ -237,7 +238,7 @@ static bool BackwardFCStorageType(const nnvm::NodeAttrs& attrs, | |||
bool dispatched = false; | |||
if (!dispatched && common::ContainsOnlyStorage(*in_attrs, mxnet::kDefaultStorage)) { | |||
dispatched = storage_type_assign(out_attrs, mxnet::kDefaultStorage, | |||
dispatch_mode, DispatchMode::kFCompute); | |||
dispatch_mode, DispatchMode::kFComputeEx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to enable dnnl fc bwd in another PR since there is an known issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminder
@ptrendx thanks for your review. Feel free to let me know if you have other concerns, we are going to merge PR recently. |
Hi @leezu, @larroy, this PR can pass the builds for ARM but always hit the time out of the test. We don't have any environment to reproduce. Could you please take a look or have any suggestion for further debugging? |
@@ -0,0 +1,167 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha Do we need Apache license header for this new file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done thanks
@larroy Do you have idea how to display more logs for the edge tests? It consistently fail at this stage. |
…cast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code
2. disable mkldnn fc bwd
… compiler error 'fatal error C1002: compiler is out of heap space in pass 2'
This reverts commit 7360246.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TaoLv @ZhennanQin @ciyongch @eric-haibin-lin @szha please take a final review for this PR.
If no further comments, I will merge the PR in tomorrow :)
I am merging now. If there're any other comments, we can resolve by new PR. |
* Add Bfloat16 * mshadow support bf16 * rebase bf16 mkldnn1.0 * support bf16 gemm * resolve fp32 ip bwd bug * add other bf16 ops * change func name from fp16 to lp16 (low precision 16), to include bf16 * add amp_cast bf16 support for ndarray * fix executor copy_params * add test case for bf16 * remove numpy dtype hook for bf16 * add bf16 type support * rebase to mxnet master * add single conv test * fix symbolic inference * add dtype check when copy * add single conv and bn test * skip fp16 amp_cast test in cpu * Fix resnet50 first convolution * Skip first convolution for bfloat16 * support bf16 fallback compute * recover origin test * add some bf16 unittests * fix bf16 bn test, enhance assert_almost_equal_with_err * using assert_almost_equal_with_err for fallback bn test * add relu6 bf16 support * fix lint * fix subgraph conv with data=0 * mkldnn doesn't support 0 dim tensor * rm dtype check when copy * using bf16 tvm * rm bf16 mnist demo * use official tvm * change function name; fix lint error * fix clang check error:conditional expression is ambiguous; 'float' can be converted to 'mshadow::bfloat::bf16_t' and vice versa * nvcc compiler build pass * fix gpu amp cast symbol error * fix mnist training error * fix cpp test: Engine.VarVersion error * workaround cpp failed test mkldnn fc bwd * to fix mkldnn test_mkldnn_ndarray_slice error * 1. move some code from to np_broadcast_reduce_op_value.cc to np_broadcast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code * use official dlpack * rename np_broadcast_reduce_op_value_part2.cc and add some description * 1. update dlpack url in .gitmodule 2. disable mkldnn fc bwd * fix remaining NodePtr due to tvm update * mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN compiler error 'fatal error C1002: compiler is out of heap space in pass 2' * fix WIN CPU build fail:compiler is out of heap space in pass 2 * fix WIN build fail * fix lint * add print for test bf16_concat * fix bf16 test fail * disable bf16 concat test * tmp skip to root cause edge test halt * fix bf16_bn test error * enable test_bulk * tmp rm bf16 to locate edge error * Revert "tmp rm bf16 to locate edge error" This reverts commit 7360246. * add Apache license header * trigger CI * add robust for test bf16 bn Co-authored-by: Zhennan Qin <zhennan.qin@intel.com> Co-authored-by: YixinBao <yixin.bao@intel.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wuxun Zhang <wuxun.zhang@intel.com>
* Add Bfloat16 * mshadow support bf16 * rebase bf16 mkldnn1.0 * support bf16 gemm * resolve fp32 ip bwd bug * add other bf16 ops * change func name from fp16 to lp16 (low precision 16), to include bf16 * add amp_cast bf16 support for ndarray * fix executor copy_params * add test case for bf16 * remove numpy dtype hook for bf16 * add bf16 type support * rebase to mxnet master * add single conv test * fix symbolic inference * add dtype check when copy * add single conv and bn test * skip fp16 amp_cast test in cpu * Fix resnet50 first convolution * Skip first convolution for bfloat16 * support bf16 fallback compute * recover origin test * add some bf16 unittests * fix bf16 bn test, enhance assert_almost_equal_with_err * using assert_almost_equal_with_err for fallback bn test * add relu6 bf16 support * fix lint * fix subgraph conv with data=0 * mkldnn doesn't support 0 dim tensor * rm dtype check when copy * using bf16 tvm * rm bf16 mnist demo * use official tvm * change function name; fix lint error * fix clang check error:conditional expression is ambiguous; 'float' can be converted to 'mshadow::bfloat::bf16_t' and vice versa * nvcc compiler build pass * fix gpu amp cast symbol error * fix mnist training error * fix cpp test: Engine.VarVersion error * workaround cpp failed test mkldnn fc bwd * to fix mkldnn test_mkldnn_ndarray_slice error * 1. move some code from to np_broadcast_reduce_op_value.cc to np_broadcast_reduce_op_value_part2.cc to pass Win CPU/GPU build (fatal error C1002: compiler is out of heap space in pass 2) 2. rm debug code * use official dlpack * rename np_broadcast_reduce_op_value_part2.cc and add some description * 1. update dlpack url in .gitmodule 2. disable mkldnn fc bwd * fix remaining NodePtr due to tvm update * mv some code from mxnet_op.h to mxnet_op_kernel_assign.h to avoid WIN compiler error 'fatal error C1002: compiler is out of heap space in pass 2' * fix WIN CPU build fail:compiler is out of heap space in pass 2 * fix WIN build fail * fix lint * add print for test bf16_concat * fix bf16 test fail * disable bf16 concat test * tmp skip to root cause edge test halt * fix bf16_bn test error * enable test_bulk * tmp rm bf16 to locate edge error * Revert "tmp rm bf16 to locate edge error" This reverts commit 7360246. * add Apache license header * trigger CI * add robust for test bf16 bn Co-authored-by: Zhennan Qin <zhennan.qin@intel.com> Co-authored-by: YixinBao <yixin.bao@intel.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wuxun Zhang <wuxun.zhang@intel.com>
Description
Bfloat16 is wildly used in Deep Learning especially on training to get a better performance.
This PR is to add bf16 support based on mxnet AMP(Automatic Mixed Precision) module .
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Comments
This PR has passed unitest and preci test in local machine.
Unit tests are added for this PR.
@ZhennanQin @ElaineBao @xinyu-intel @TaoLv @PatricZhao