-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[BUGFIX]try avoid the error in operator/tensor/amp_cast.h #20188
Conversation
I'm trying to avoid the error generated by amp using bfloat16 The error is due to: ``` /me/prog/prog-amp.py:77: UserWarning: All children of this Sequential layer 'compose1_' are HybridBlocks. Consider using HybridSequential for the best performance. transform_test.hybridize(static_alloc=True,static_shape=True) Traceback (most recent call last): File "/me/prog/prog-amp.py", line 359, in <module> loss0 = loss_fn(output, label) File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 314, in __mul__ return multiply(self, other) File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 3757, in multiply return _ufunc_helper( File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 3576, in _ufunc_helper return fn_array(lhs, rhs) File "/me/incubator-mxnet/python/mxnet/contrib/amp/amp.py", line 109, in _new_fun return f(*args, **kwargs) File "<string>", line 52, in broadcast_mul File "/me/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 82, in _imperative_invoke check_call(_LIB.MXImperativeInvokeEx( File "/me/incubator-mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "/me/incubator-mxnet/src/io/../operator/elemwise_op_common.h", line 135 MXNetError: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 1-th input: expected bfloat16, got float32 Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/me/incubator-mxnet/python/mxnet/base.py", line 587, in _notify_shutdown check_call(_LIB.MXNotifyShutdown()) File "/me/incubator-mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "/me/incubator-mxnet/src/operator/tensor/./amp_cast.h", line 136 MXNetError: Unknown type enum 12 ``` which is tested under mxnet v1.x, but seems also affect v2.0 since 30-series RTX card support bfloat16, there is no need to disable it using `#ifndef __NVCC__` explicitly, I don't know whether it works, but things could not be worse.
Hey @Neutron3529 , Thanks for submitting the PR
CI supported jobs: [centos-gpu, unix-cpu, unix-gpu, clang, website, windows-cpu, edge, sanity, centos-cpu, windows-gpu, miscellaneous] Note: |
@mxnet-bot run ci [unix-gpu] |
@mxnet-bot run ci [unix-gpu, unix-cpu, centos-cpu] |
Jenkins CI successfully triggered : [unix-gpu, unix-cpu, centos-cpu] |
@mxnet-bot run ci [unix-gpu, centos-cpu] |
Jenkins CI successfully triggered : [centos-cpu, unix-gpu] |
@Neutron3529 thank you! |
You're welcome. Since I am in China, it is not very convenience to visit github. This commit do not work at least with convolution layers. |
* try avoid the error in operator/tensor/amp_cast.h I'm trying to avoid the error generated by amp using bfloat16 The error is due to: ``` /me/prog/prog-amp.py:77: UserWarning: All children of this Sequential layer 'compose1_' are HybridBlocks. Consider using HybridSequential for the best performance. transform_test.hybridize(static_alloc=True,static_shape=True) Traceback (most recent call last): File "/me/prog/prog-amp.py", line 359, in <module> loss0 = loss_fn(output, label) File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 314, in __mul__ return multiply(self, other) File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 3757, in multiply return _ufunc_helper( File "/me/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 3576, in _ufunc_helper return fn_array(lhs, rhs) File "/me/incubator-mxnet/python/mxnet/contrib/amp/amp.py", line 109, in _new_fun return f(*args, **kwargs) File "<string>", line 52, in broadcast_mul File "/me/incubator-mxnet/python/mxnet/_ctypes/ndarray.py", line 82, in _imperative_invoke check_call(_LIB.MXImperativeInvokeEx( File "/me/incubator-mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "/me/incubator-mxnet/src/io/../operator/elemwise_op_common.h", line 135 MXNetError: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node at 1-th input: expected bfloat16, got float32 Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/me/incubator-mxnet/python/mxnet/base.py", line 587, in _notify_shutdown check_call(_LIB.MXNotifyShutdown()) File "/me/incubator-mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "/me/incubator-mxnet/src/operator/tensor/./amp_cast.h", line 136 MXNetError: Unknown type enum 12 ``` which is tested under mxnet v1.x, but seems also affect v2.0 since 30-series RTX card support bfloat16, there is no need to disable it using `#ifndef __NVCC__` explicitly, I don't know whether it works, but things could not be worse. * forgive my garbage coding, I'm not a computer scientist * revert all the modification of base.h Co-authored-by: Neutron3529 <qweytr1@mail.ustc.edu.cn>
I'm trying to avoid the error generated by amp using bfloat16
The error is due to:
which is tested under mxnet v1.x, but seems also affect v2.0
since 30-series RTX card support bfloat16, there is no need to disable it using
#ifndef __NVCC__
explicitly,I don't know whether it works, but things could not be worse.
Description
such code will fail in previous version of mxnet, and here I provide a workaround.
further modification of bf16 is needed.
Checklist
Essentials
Changes
Comments
Actually this PR does nothing, Further support of bf16 (including a very important operator
convolution
) is required but I know nothing about cudnn.