Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NCCLBcast hang up bug in Parallel Executor #11377

Merged
merged 4 commits into from
Jun 14, 2018

Conversation

velconia
Copy link
Collaborator

@velconia velconia commented Jun 11, 2018

this close #11375

velconia added 2 commits June 11, 2018 23:04
2. Check the memory usage of ALL gpus rather than the first one
…clGroupEnd blocking the exception throwing

2. NOTE the usage of NCCLGroupGuard
@@ -63,7 +64,17 @@ ParallelExecutor::ParallelExecutor(
member_->global_scope_ = scope;
member_->use_cuda_ = exec_strategy.use_cuda_;

// Step 1. Bcast the params to devs.
// Step 1. check all memory usages of all places.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these code is not needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these messages should be print out also, so I didn't remove it.

@@ -41,6 +41,11 @@ inline ncclDataType_t ToNCCLDataType(std::type_index type) {
}
}

// NOTE(minqiyang): according to the ncclGroupEnd documentations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think we can assume people who develop paddlepaddle with this file is familiar with the NCCL natives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this bug happens again, I leave these comments.

platform::dynload::ncclBcast(buffers[i], numel, data_type, 0,
nccl_ctx.comm_, nccl_ctx.stream());
}
member_->nccl_ctxs_->WaitAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line may not needed? since ncclgroupend will sync all group calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengduoZH Can you please help take a look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero line 185 is necessary,

  • ncclgroupend only ensures that there is only one thread can invoke nccl function, but not sync the calls.
  • the invoking of nccl functions is async, so we need use WaitAll() to ensure the invoking have completed on GPU side.

// This step will create BuddyAllocator for each place, which will
// 1. enforce that the place is avaliable and NOT used.
// 2. avoid ncclBcast hanging up for NOT enough memory to use.
for (size_t i = 0; i < member_->places_.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does these lines was intended to let the process fail fast?

Copy link
Collaborator Author

@velconia velconia Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and print some useful messages

platform::dynload::ncclBcast(buffers[i], numel, data_type, 0,
nccl_ctx.comm_, nccl_ctx.stream());
}
member_->nccl_ctxs_->WaitAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero line 185 is necessary,

  • ncclgroupend only ensures that there is only one thread can invoke nccl function, but not sync the calls.
  • the invoking of nccl functions is async, so we need use WaitAll() to ensure the invoking have completed on GPU side.

@@ -145,9 +156,9 @@ void ParallelExecutor::BCastParamsToGPUs(
auto &dims = main_tensor.dims();
if (paddle::platform::is_gpu_place(main_tensor.place())) {
#ifdef PADDLE_WITH_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the modify of line159~167 is necessary.
Because if the memory of one Place is insufficient, the program will throw an exception in this line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the thrown exception will not be handled properly, this PR was submitted to fix this bug

size_t usage = memory::memory_usage(member_->places_[i]);
VLOG(4) << "Memory usage of device: " << member_->places_[i] << " is "
<< usage << " bytes";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check just should do one time. So maybe it is not appropriate for placing it here.

Copy link
Collaborator Author

@velconia velconia Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because BuddyAllocator's array was a static member in function in this line, so this check can do many times in the constructor in ParallelExecutor, without init duplicated BuddyAllocator

@@ -25,6 +25,7 @@ limitations under the License. */
#include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h"
#include "paddle/fluid/framework/details/ssa_graph_builder_factory.h"
#include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h"
#include "paddle/fluid/memory/malloc.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

@@ -63,7 +64,7 @@ ParallelExecutor::ParallelExecutor(
member_->global_scope_ = scope;
member_->use_cuda_ = exec_strategy.use_cuda_;

// Step 1. Bcast the params to devs.
// Step 2. Bcast the params to devs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==> Step 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

当ParallelExecutor运行在已被占用的GPU上时, Trainer夯住
4 participants