forked from horovod/horovod
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Added error handling in MXNet #19
Open
apeforest
wants to merge
47
commits into
mxnet_feature_fp16
Choose a base branch
from
develop/mxnet
base: mxnet_feature_fp16
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 39 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
d625394
Make mxnet build successful in CPU
apeforest 02ab771
update required mxnet version
apeforest 9abcc4e
remove outdated comment
apeforest cd096e4
remove commented line
apeforest f9b2083
Merge remote-tracking branch 'origin/mxnet_feature_fp16' into develop…
apeforest b0e2e58
fix test in CPU
apeforest dd4f9e2
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest b617e14
refactor
apeforest 84ed58e
Merge branch 'mxnet_feature_fp16' into develop/mxnet
yuxihu 2b902ae
link nccl to mpi_lib for mxnet
yuxihu ff57e51
Merge branch 'develop/mxnet' of /~https://github.com/ctcyang/horovod in…
apeforest 6013957
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest bc47aa9
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest 297e79a
make mxnet build process the same as tensorflow
apeforest f28ba01
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest ab78201
compute allreduce average in C++ to avoid perf deg
apeforest dc62625
rename variable
apeforest c56322f
add mxnet mnist example
apeforest 4eb787e
fix lint
apeforest 3e5491a
reduce epoch and acc check
apeforest 9589209
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest b42f0c5
broadcast initial parames
apeforest 13adbb3
Update README
apeforest b4aa9f2
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest f9c9d73
remove unused handle manager
apeforest dc96acc
renaming variable type
apeforest aaf3d7f
return non empty op name
apeforest 0797570
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest 89ba103
scale learning rate by workers
apeforest 60877b7
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest b3a24db
refactor test_mxnet to make it easier to read
apeforest 6e4b845
fix a bug in building on GPU
apeforest 710c703
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest 0112e6a
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest 4a1c010
polish imagenet example
apeforest 61741e8
add handle_manager
apeforest c24d0bd
error handling in MXNet
apeforest effd043
Merge branch 'mxnet_feature_fp16' into develop/mxnet
apeforest 1c9443f
add exception handling
apeforest 9b9bab1
rename c_api_common
apeforest 2d64e05
wrap MXNet C API with exception handling
apeforest 1cd08be
remove unused function declaration
apeforest 77cbb8b
fix a typo
apeforest 4f1a626
fix a bug
apeforest c1c476c
fix build error
apeforest 51f81d0
Merge branch 'mxnet_feature_fp16' into develop/mxnet
75c56f7
Merge remote-tracking branch 'origin/mxnet_feature_fp16' into develop…
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
// ============================================================================= | ||
|
||
#include "handle_manager.h" | ||
|
||
namespace horovod { | ||
namespace mxnet { | ||
|
||
typedef ::mxnet::Engine::CallbackOnComplete Callback; | ||
|
||
int HandleManager::AllocateHandle() { | ||
int handle = last_handle_.fetch_add(1) + 1; | ||
std::lock_guard<std::mutex> guard(mutex_); | ||
results_[handle] = nullptr; | ||
return handle; | ||
} | ||
|
||
void HandleManager::MarkDone(int handle, const Status& status) { | ||
std::lock_guard<std::mutex> guard(mutex_); | ||
results_[handle] = std::make_shared<Status>(status); | ||
} | ||
|
||
void HandleManager::AttachCallback(int handle, Callback cb) { | ||
std::unique_lock<std::mutex> lock(mutex_); | ||
if (callbacks_.find(handle) == callbacks_.end()) { | ||
callbacks_[handle] = std::make_shared<Callback>(cb); | ||
} | ||
} | ||
|
||
void HandleManager::ExecuteCallback(int handle) { | ||
std::unique_lock<std::mutex> lock(mutex_); | ||
if (callbacks_.find(handle) == callbacks_.end()) { | ||
return; | ||
} | ||
auto cb_ptr = callbacks_[handle]; | ||
lock.unlock(); | ||
if (cb_ptr != nullptr) { | ||
(*cb_ptr)(); | ||
} | ||
} | ||
|
||
bool HandleManager::PollHandle(int handle) { | ||
std::lock_guard<std::mutex> guard(mutex_); | ||
if (results_.find(handle) == results_.end()) { | ||
throw std::invalid_argument("Handle " + std::to_string(handle) + | ||
" was not created or has been cleared."); | ||
} | ||
return results_[handle] != nullptr; | ||
} | ||
|
||
std::shared_ptr<Status> HandleManager::ReleaseHandle(int handle) { | ||
std::lock_guard<std::mutex> guard(mutex_); | ||
if (results_.find(handle) == results_.end()) { | ||
throw std::invalid_argument("Handle " + std::to_string(handle) + | ||
" was not created or has been cleared."); | ||
} | ||
auto status = results_[handle]; | ||
results_.erase(handle); | ||
callbacks_.erase(handle); | ||
return status; | ||
} | ||
|
||
} // namespace mxnet | ||
} // namespace horovod |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
// ============================================================================= | ||
|
||
#ifndef HOROVOD_MXNET_HANDLE_MANAGER_H | ||
#define HOROVOD_MXNET_HANDLE_MANAGER_H | ||
|
||
#include <atomic> | ||
#include <memory> | ||
#include <mutex> | ||
#include <unordered_map> | ||
|
||
#include "../common/common.h" | ||
|
||
#include <mxnet/engine.h> | ||
|
||
namespace horovod { | ||
namespace mxnet { | ||
|
||
using namespace horovod::common; | ||
|
||
typedef ::mxnet::Engine Engine; | ||
typedef ::mxnet::NDArray NDArray; | ||
typedef ::mxnet::Engine::CallbackOnComplete Callback; | ||
|
||
class HandleManager { | ||
public: | ||
int AllocateHandle(); | ||
void AttachCallback(int handle, Callback cb); | ||
void MarkDone(int handle, const Status& status); | ||
void ExecuteCallback(int handle); | ||
bool PollHandle(int handle); | ||
std::shared_ptr<Status> ReleaseHandle(int handle); | ||
|
||
private: | ||
std::atomic_int last_handle_; | ||
std::unordered_map<int, std::shared_ptr<Status>> results_; | ||
std::unordered_map<int, std::shared_ptr<Callback>> callbacks_; | ||
std::mutex mutex_; | ||
}; | ||
|
||
} // namespace mxnet | ||
} // namespace horovod | ||
|
||
#endif // HOROVOD_MXNET_HANDLE_MANAGER_H |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use lock_guard here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated