-
Notifications
You must be signed in to change notification settings - Fork 6.8k
resnet cpp-package test is broken #14406
Comments
Hey, this is the MXNet Label Bot. |
I wonder the memory of GPU in CI. |
In addition, the model in cpp-package seems to be not convergent. |
I think its running on a p3.8xlarge which should be sufficient to run this test. @marcoabreu can you confirm. |
yes i observed that too. |
Since the input shape of ResNet is (3, 224, 224), so I resized the MNIST image (1, 28, 28) to (3, 224, 224). |
We run on a g3.8xlarge |
Changing batch size to a smaller value will address the OOM issue. |
@marcoabreu There are no changes to the alexnet.cpp, resnet.cpp or cpp-package recently. These tests were part of CI tests and have been passing before. We can change the examples so that pass on lower capacity instances, in my opinion that won't be the right solution. |
Did infra that these tests are run on have changed recently? It seems that the test would be running fine on p3.8xl but would fail on g3.8x (legacy hardware)... @marcoabreu |
as i said this happened in waitall change. waitall earlier used to hide exceptions, but with the PR: #14397 it is thrown. These problems would have been there from before but surfacing now. |
I tried these examples with the recent code change in "WaitAll()" on p2.16x instances and c5.18x instances. I did not see the crash. However, we still need to add missing exception handling in the example so that we can prevent the crashes due to unhandled exceptions. |
hi @leleamol . to reproduce you will have to use g3.8xlarge. I was able to reproduce on a g3.8xlarge. |
Could someone please look the GPU memory used by the model? |
the last i observed it was around 11GB. For now I am going to use smaller batch_size for tests and later @leleamol will revisit and improve the cpp tests. |
@anirudh2290 |
This issue can be closed since the PR is merged. @lanking520 |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14397/5/pipeline
after adding waitall support the resnet example is failing with cudamalloc out of memory error.
The text was updated successfully, but these errors were encountered: