-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[CI][NightlyTestsForBinaries] Test Large Tensor: CPU killing node instance #14980
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [test] |
@ChaiBapchya @apeforest @access2rohit @anirudh2290 @larroy The discussion has come up that this test needs to be refactored so that several gigs of memory are not needed to run tests. Pedro brought this up in an email subject "Tests with large inputs and rationalize resource usage. Better testing strategies..." Could someone please take lead on this? The test is currently disabled in the Jenkins steps for now |
@larroy @Chancebair Any suggestion for updating the tests? We still need to test large tensors for CPU. Please advise what the best practice is. |
@apeforest Im talking to @access2rohit to understand how can we test this better. I will update this. |
The discussion was moved to devlist. |
@Chancebair @perdasilva can we rerun the linked CI run? is the failure related to the test or something else? (what's the root cause?) |
I think the tests were causing the machine to run out of ram and crash out. Should I re-run it? |
how much ram do we have in that machine? If you are confident is not going to cause problems in your fleet I'd say let's run it to see if the problem persist. |
I think the CPU instances are c5.4xlarge - so 32GB. There's been a PR to disable the CPU tests. I would suggest to re-enable it in a PR, copy the nightly tests for binaries Jenkinsfile content to be be of the PR Jenkins files and seeing if it works. I'm sorry I can't do it atm as I'm on sick leave =S |
@szha The link shared about has flags -DMSHADOW_INT64_TENSOR_SIZE=0 which is not Large Tensor Build. Currently Large Tensor Tests are only in nightly(Not CI that runs on every PR). Is it possible that you copied a different link ? |
Description
It seems
Test Large Tensor: CPU
is killing the underlying CI node somehow:Complete log:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/312/pipeline/
The text was updated successfully, but these errors were encountered: