-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034
Comments
Hey, this is the MXNet Label Bot. |
The refactor for legacy code is not easy and we have taken these tasks to make MXNet going forward. |
@pengzhao-intel It seems there is another refactor in #14713. Will it cause more behaviors similar in this issue? |
@DickJC123 I can see many discussions about resource and storage in #14476. Could you please take a look at them? Hope they can address part of your questions. Thanks. |
I experimented with reverting the workspace and dropout state handling changes of commit 1c49e40 to the original approach. This has eliminated the failures seen on P40. I will be making a PR of this tomorrow. |
cudnn set dropout descriptor is very slow if the dropout state space is not reused. |
I discovered a new issue with commit #14476: it's use of kCuDNNDropoutDesc and the cuDNN API that supports it was only possible starting with cuDNN 7.0. From this commit onward, the MXNet master no longer compiles against cuDNN 6.0. I will work up a PR that uses the new dropout state handling approach only when it is available, thus eliminating the inadvertent 'breaking change.' |
cudnn support for specific versions has never been promised in mxnet. The only cuda version that requires compiling with cudnn 6.0 was cuda 7.5, which is no longer supported in mxnet. |
I found that moving the RNN workspace from a permanent per-instance allocation to the prior (and typical) approach of using the TempSpace resource corrected the test flakeyness on P40. I won't touch the dropout descriptor handling, nor the fact that it implicitly forces MXNet users to cuda v7.0. PR shortly. |
Description
The failures are seen infrequently and nondeterministically (but within 1000 trials) on a GPU platform when run on NVIDIA P40. Based on initial investigation, the problem is introduced by the commit:
... which is not too surprising given the sizeable refactoring of the rnn code with that commit.
Because the P40 has far fewer compute resources compared to P100 and V100, I suspect a timing related issue. No failures are seen on P100 or V100, nor on P40 on a checkout of the prior commit to 1c49e40. Looking over that commit, I see changes in how the various 'spaces' are handled in the GPU case. Maybe the commit author @lihaofd can chime on the need/motivation for these changes:
Prior to the commit, the 'workspace' (as set by cudnnGetRNNWorkspaceSize) was allocated from MXNet's TempSpace. With the commit, it becomes a per-instance permanent allocation.
Also, prior to the commit, the dropout state space was a per-instance permanent allocation, while with the commit it became managed by the MXNet context resources (and swapped in/out with various instance uses). While I understand that MXNet is set up to manage the dropout state, is there any other motivation to make this switch?
When the test fails, the non-fused model output is random garbage. To support the notion that a race condition exists, the test failures go away when a waitall() is inserted in test_operator.py function check_rnn_consistency:
@ptrendx @eric-haibin-lin , I'd like to see this resolved by the 1.5 code freeze.
Environment info (Required)
Error Message:
Two example outputs shown:
Second failure: note difference in second (unfused) rnn model only.
Steps to reproduce
MXNET_TEST_COUNT=1000 MXNET_TEST_SEED=42 nosetests --verbose -s --logging-level=DEBUG tests/python/gpu/test_operator_gpu.py:test_rnntanh_bidirectional
What have you tried to solve it?
See above discussion.
The text was updated successfully, but these errors were encountered: