Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

C++ test Core dump DROPOUT_PERF.TimingGPU #9857

Open
marcoabreu opened this issue Feb 22, 2018 · 14 comments
Open

C++ test Core dump DROPOUT_PERF.TimingGPU #9857

marcoabreu opened this issue Feb 22, 2018 · 14 comments

Comments

@marcoabreu
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/399/pipeline

Timing: 50 iterations of 10 calls, shape = [50,3,18,32] = 86,400 items 

Dropout Operator CPU:  Timing [Forward] 28.45100 ms, avg: 0.05690 ms X 500 passes



[       OK ] DROPOUT_PERF.TimingCPU (2745 ms)

[ RUN      ] DROPOUT_PERF.TimingGPU

Timing: 50 iterations of 10 calls, shape = [1,1,28,28] = 784 items 

terminate called after throwing an instance of 'dmlc::Error'

  what():  [22:22:55] ../mshadow/mshadow/./stream_gpu-inl.h:182: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered



Stack trace returned 10 entries:

[bt] (0) build/tests/mxnet_unit_tests(dmlc::StackTrace[abi:cxx11]()+0x56) [0xe15776]

[bt] (1) build/tests/mxnet_unit_tests(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0xe15d78]

[bt] (2) build/tests/mxnet_unit_tests(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0xb9) [0xe183f9]

[bt] (3) build/tests/mxnet_unit_tests(mxnet::test::op::CoreOpExecutor<float, float>::~CoreOpExecutor()+0x151) [0xe5f321]

[bt] (4) build/tests/mxnet_unit_tests(std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()+0x46) [0xe1b316]

[bt] (5) build/tests/mxnet_unit_tests() [0xf0435e]

[bt] (6) build/tests/mxnet_unit_tests(DROPOUT_PERF_TimingGPU_Test::TestBody()+0x5e1) [0xf06401]

[bt] (7) build/tests/mxnet_unit_tests(void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x43) [0xf36f03]

[bt] (8) build/tests/mxnet_unit_tests(testing::Test::Run()+0xba) [0xf27f3a]

[bt] (9) build/tests/mxnet_unit_tests(testing::TestInfo::Run()+0x118) [0xf28088]





/workspace/tests/ci_build/with_the_same_user: line 47: 83345 Aborted                 (core dumped) sudo -u "#${CI_BUILD_UID}" --preserve-env "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" "HOME=${CI_BUILD_HOME}" ${COMMAND[@]}

script returned exit code 134
@marcoabreu marcoabreu added C++ Related to C++ Test Flaky labels Feb 22, 2018
@marcoabreu
Copy link
Contributor Author

@cjolivier01

@cjolivier01
Copy link
Member

looks like someone broke it, eh?

@marcoabreu
Copy link
Contributor Author

Yep :(

@cjolivier01
Copy link
Member

then it's not "flaky", right? I mean, it's never broken before, right? I also noticed it breaking in another build today.
I wonder if it ever broke pre-mkldnn merge?

@marcoabreu
Copy link
Contributor Author

It's flaky because otherwise it would have been reported during PR stage.

Well the number of test failures drastically increased since the mkldnn merge. But that's why we merged it, right? To get some data early on.

@anirudh2290
Copy link
Member

just a side note, should we add try catch around for TESTs and fail so that consecutive tests continue to execute ?

@marcoabreu
Copy link
Contributor Author

No, this would waste too many resources. Generally, multiple failures can be fixed by one change. If we wouldn't follow fail-fast, a lot of time would be wasted. Considering the fact that PRs should be small, the chance of multiple independent bugs causing multiple test failures is quite small compared to the number of total tests executed within every PR.

The right approach is to fix one test failure, create a new commit and then fix the next test failure. While this actually sounds pretty wasteful as well, past experience has shown that this is rarely required.

Additionally, we're currently working on something that allows you to run tests locally in exactly the same way the CI does. This will allow you to test yourself before pushing it to the CI.

@marcoabreu marcoabreu changed the title Core dump DROPOUT_PERF.TimingGPU C++ test Core dump DROPOUT_PERF.TimingGPU Mar 17, 2018
@rajanksin
Copy link
Contributor

@marcoabreu I ran this test ~1000 times, couldnt replicate the failure. Can we close this issue ?

@KellenSunderland
Copy link
Contributor

@spidydev @marcoabreu I think there were some crossed wires on this one. This test is actually enabled. Was it at one time disabled?

@rajanksin
Copy link
Contributor

@KellenSunderland From the history, I dont think the test was ever disabled.

@KellenSunderland
Copy link
Contributor

@spidydev agree. @marcoabreu think we could close this one?

@marcoabreu
Copy link
Contributor Author

Sure, thanks for following up. I think this was due to the MKLDNN change.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Jan 19, 2019

@leleamol
Copy link
Contributor

The issue is not related to C++ API. Removing the C++ label.

@mxnet-label-bot remove [C++]

@marcoabreu marcoabreu removed the C++ Related to C++ label Feb 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants