-
Notifications
You must be signed in to change notification settings - Fork 6.8k
enabling build stage gpu_int64 to enable large tensor nightly runs #17546
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
@apeforest can you review this |
tests/nightly/JenkinsfileForBinaries
Outdated
@@ -49,7 +49,7 @@ core_logic: { | |||
utils.pack_lib('cpu_int64', mx_cmake_lib) | |||
} | |||
} | |||
}, | |||
}*/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a little bit confusing. We are actually testing cpu context on GPU platforms here. The reason we don't use CPU node is simply the CPU node type does not have big enough memory as GPU node.
We should either modify the CPU node type with a memory optimized one such as R5 (this is the ideal solution)
Or we rename the pipeline stage so that it's clear to people these are tests running on CPUs and remove the commented CPU test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change “GPU: USE_INT64_TENSOR_SIZE” -> “USE_INT64_TENSOR_SIZE” to enable first.
Changing instance type is a temp fix. Theoritically everything should be able to run on CPU. So, fixing MXNet memory management would be the ideal solution but that would take lot of time. Switching to R5 will bump up our costs a bit too.
Let me know if "USE_INT64_TENSOR_SIZE" makes sense as interim solution to enable LT on nightly first.
2972ed5
to
1238839
Compare
1238839
to
3886b96
Compare
@apeforest updated the PR after incorporating your suggestions as discussed offline. |
@mxnet-label-bot update [pr-awaiting-merge] |
Description
Fixes nightly build failure due to absence of large tensor build artifact required for testing large tensor support on a nightly basis.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.