Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

enabling build stage gpu_int64 to enable large tensor nightly runs #17546

Merged
merged 1 commit into from
Feb 10, 2020

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Feb 7, 2020

Description

Fixes nightly build failure due to absence of large tensor build artifact required for testing large tensor support on a nightly basis.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@access2rohit
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 7, 2020
@access2rohit
Copy link
Contributor Author

@apeforest can you review this

@@ -49,7 +49,7 @@ core_logic: {
utils.pack_lib('cpu_int64', mx_cmake_lib)
}
}
},
}*/,
Copy link
Contributor

@apeforest apeforest Feb 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little bit confusing. We are actually testing cpu context on GPU platforms here. The reason we don't use CPU node is simply the CPU node type does not have big enough memory as GPU node.

We should either modify the CPU node type with a memory optimized one such as R5 (this is the ideal solution)

Or we rename the pipeline stage so that it's clear to people these are tests running on CPUs and remove the commented CPU test here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change “GPU: USE_INT64_TENSOR_SIZE” -> “USE_INT64_TENSOR_SIZE” to enable first.

Changing instance type is a temp fix. Theoritically everything should be able to run on CPU. So, fixing MXNet memory management would be the ideal solution but that would take lot of time. Switching to R5 will bump up our costs a bit too.

Let me know if "USE_INT64_TENSOR_SIZE" makes sense as interim solution to enable LT on nightly first.

@access2rohit access2rohit force-pushed the re-enable_large_tensor branch from 2972ed5 to 1238839 Compare February 7, 2020 22:05
@access2rohit access2rohit force-pushed the re-enable_large_tensor branch from 1238839 to 3886b96 Compare February 7, 2020 22:08
@access2rohit
Copy link
Contributor Author

@apeforest updated the PR after incorporating your suggestions as discussed offline.

@access2rohit
Copy link
Contributor Author

@mxnet-label-bot update [pr-awaiting-merge]

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Feb 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants