Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use store for gloo process group #40629

Merged
merged 68 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
9ba08b1
rename TensorBase interface data_type() to dtype()
zyfncg Nov 16, 2021
3c1afc0
rename type to dtype of TensorMeta
zyfncg Nov 17, 2021
288f086
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
zyfncg Nov 17, 2021
701a0bd
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
zyfncg Nov 17, 2021
7bc3cbb
merge the code
zyfncg Nov 17, 2021
7b79b03
merge the code
zyfncg Nov 17, 2021
471a1bf
fix the problem when merge conflict
zyfncg Nov 18, 2021
d39a1d9
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
zyfncg Nov 19, 2021
835e415
fix bug of ci caused by type of tensor_meta
zyfncg Nov 19, 2021
ab60a6d
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Nov 19, 2021
471741f
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Dec 20, 2021
691056a
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Jan 20, 2022
9fc70fe
[Eager] Support eager grad interface, draft version
veyron95 Mar 4, 2022
ba8d79e
Support eager grad interface with allow_unused and multi startup_op
veyron95 Mar 7, 2022
137db9d
Fix code format
veyron95 Mar 8, 2022
1a18aa2
Fix allow_unused case, return PyNone if tensor not initialize
veyron95 Mar 8, 2022
d09ec3b
Support output's stop_gradient related to create_graph
veyron95 Mar 8, 2022
f84f2be
Support grad exception case in eager mode, fix coverage CI
veyron95 Mar 8, 2022
733672e
Update ToPyObject, return PyNone if not initialize
veyron95 Mar 8, 2022
68b1991
AccumulationNode add FLAGS_retain_grad_for_all_tensor
veyron95 Mar 8, 2022
7665d63
Fix ci issue
veyron95 Mar 9, 2022
86393f5
Fix CI issue
veyron95 Mar 9, 2022
7dc697a
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Mar 9, 2022
c653ec0
fix, use core.eager.Tensor
veyron95 Mar 9, 2022
9156cea
Add func SetBufferSlotRankZeros for GradTensorHolder
veyron95 Mar 9, 2022
6fd613d
Support retain_graph by using ClearTensorWrappers
veyron95 Mar 9, 2022
58731e9
Support retain_graph by using ClearTensorWrappers
veyron95 Mar 9, 2022
a88f9b1
Update retain_graph and no_grad_vars related test case
veyron95 Mar 9, 2022
778719b
Update code gen logic for ClearTensorWrappers
veyron95 Mar 10, 2022
65cf9e3
Fix by override statement
veyron95 Mar 10, 2022
af7b919
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 10, 2022
4d3b57d
fix override func args
veyron95 Mar 10, 2022
415ff65
Support retain_graph, update unit tests
veyron95 Mar 10, 2022
bb283ce
Updated ClearTensorWrappers logic
veyron95 Mar 10, 2022
e548c22
fix grad python interface
veyron95 Mar 11, 2022
519c9a6
Use deep copy and update unit tests
veyron95 Mar 11, 2022
1fbc61b
Polish code
veyron95 Mar 11, 2022
c0a2b8b
Polish code
veyron95 Mar 11, 2022
536a28b
Fix CI issue, Deep copy only use when user set grad_tensors
veyron95 Mar 11, 2022
2417858
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 11, 2022
34fa7c0
Fix CI, use Backward instead RunBackward
veyron95 Mar 11, 2022
1b89072
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 12, 2022
af7f058
Fix CI, Declare kernel explicitly in test file
veyron95 Mar 12, 2022
f397b8f
Polish, remove vector of TensorWrapper
veyron95 Mar 14, 2022
e3f9826
Refactor the logic of grad/backward, polish codes
veyron95 Mar 14, 2022
27830a9
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 14, 2022
7ede919
Update code after merge upstream develop
veyron95 Mar 14, 2022
f9adf49
Polish after merge upstream develop
veyron95 Mar 14, 2022
2fe3b9f
Update to adapt new GradNodeBase superclass
veyron95 Mar 14, 2022
90e97d6
Fix error introduced during conflict resolution
veyron95 Mar 14, 2022
d18697a
Update purify potential_startup_nodes logic
veyron95 Mar 15, 2022
1b5eac2
Fix errors
veyron95 Mar 15, 2022
f4e42e2
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 15, 2022
58a03b5
Polish code
veyron95 Mar 15, 2022
b04e9a9
Remove useless args for ToPyObject
veyron95 Mar 15, 2022
c7bd6fc
Remove useless TensorWrappersSet
veyron95 Mar 15, 2022
d3f0397
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Mar 15, 2022
ac85d81
Fix code-format, re-install pre-commit
veyron95 Mar 16, 2022
8312d2d
Fix pre-process logic for potential_startup_ops
veyron95 Mar 16, 2022
441bc81
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 16, 2022
f187384
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
Mar 16, 2022
e0a115b
use_store_for_gloo
Mar 16, 2022
326eee5
Update unit tests, use eager mode
veyron95 Mar 16, 2022
f8c6288
update
Mar 16, 2022
a0ec433
Merge branch 'develop' of /~https://github.com/PaddlePaddle/Paddle into…
veyron95 Mar 17, 2022
df99eea
Fix conflicts
veyron95 Mar 17, 2022
ace610d
update
Mar 17, 2022
1aada45
Merge branch 'pr40655' into use_store_for_gloo
Mar 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions paddle/fluid/distributed/collective/ProcessGroupGloo.cc
Original file line number Diff line number Diff line change
Expand Up @@ -171,10 +171,10 @@ ProcessGroupGloo::GlooTask::GlooTask(int rank,
"Only CPU place is supported for ProcessGroupGloo."));
}

ProcessGroupGloo::ProcessGroupGloo(const std::shared_ptr<GlooStore>& store,
int rank, int world_size,
const std::shared_ptr<GlooOptions> options)
: ProcessGroup(rank, world_size), _tag(0), _store(store) {
ProcessGroupGloo::ProcessGroupGloo(
const std::shared_ptr<paddle::distributed::Store>& store, int rank,
int world_size, const std::shared_ptr<GlooOptions> options)
: ProcessGroup(rank, world_size), _tag(0), _store(new GlooStore(store)) {
_context = std::make_shared<gloo::rendezvous::Context>(rank, world_size);
auto prefix_store =
::gloo::rendezvous::PrefixStore(std::to_string(0), *_store);
Expand Down
13 changes: 6 additions & 7 deletions paddle/fluid/distributed/collective/ProcessGroupGloo.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,7 @@ class ProcessGroupGloo : public ProcessGroup {

class GlooStore : public ::gloo::rendezvous::Store {
public:
explicit GlooStore(
const std::shared_ptr<paddle::distributed::TCPStore>& store)
explicit GlooStore(const std::shared_ptr<paddle::distributed::Store>& store)
: _store(store) {}

~GlooStore() = default;
Expand Down Expand Up @@ -87,7 +86,7 @@ class ProcessGroupGloo : public ProcessGroup {
}

protected:
std::shared_ptr<paddle::distributed::TCPStore> _store;
std::shared_ptr<paddle::distributed::Store> _store;
};

class GlooOptions {
Expand All @@ -100,9 +99,9 @@ class ProcessGroupGloo : public ProcessGroup {
std::shared_ptr<::gloo::transport::Device> device;
};

explicit ProcessGroupGloo(const std::shared_ptr<GlooStore>& store, int rank,
int world_size,
std::shared_ptr<GlooOptions> options);
explicit ProcessGroupGloo(
const std::shared_ptr<paddle::distributed::Store>& store, int rank,
int world_size, std::shared_ptr<GlooOptions> options);

~ProcessGroupGloo() = default;

Expand Down Expand Up @@ -145,7 +144,7 @@ class ProcessGroupGloo : public ProcessGroup {
protected:
uint32_t _tag;
std::shared_ptr<gloo::rendezvous::Context> _context;
std::shared_ptr<GlooStore> _store;
std::shared_ptr<::gloo::rendezvous::Store> _store;
};

} // namespace distributed
Expand Down
2 changes: 1 addition & 1 deletion paddle/fluid/eager/backward.cc
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ std::vector<paddle::experimental::Tensor> RunBackward(
if (grad_tensors[i].is_initialized()) {
// Deep copy
paddle::experimental::Tensor tmp_tensor;
tmp_tensor.copy_(grad_tensors[i], true);
tmp_tensor.copy_(grad_tensors[i], grad_tensors[i].inner_place(), true);
node_input_buffers_dict[grad_node]->add(input_info.first,
input_info.second, tmp_tensor);
} else {
Expand Down
20 changes: 4 additions & 16 deletions paddle/fluid/pybind/distributed_py.cc
Original file line number Diff line number Diff line change
Expand Up @@ -235,25 +235,13 @@ void BindDistributed(py::module *m) {
py::call_guard<py::gil_scoped_release>());

#if defined(PADDLE_WITH_GLOO)
py::class_<GlooOptions>(*m, "GlooOptions")
.def(py::init<>())
.def_readwrite("_device", &GlooOptions::device)
.def_static("create", &GlooOptions::create);

py::class_<GlooStore, std::shared_ptr<GlooStore>>(*m, "GlooStore")
.def(py::init(
[](const std::shared_ptr<paddle::distributed::TCPStore> &store) {
return std::make_shared<GlooStore>(store);
}),
py::call_guard<py::gil_scoped_release>());

py::class_<ProcessGroupGloo, std::shared_ptr<ProcessGroupGloo>>(
*m, "ProcessGroupGloo", ProcessGroup)
.def(py::init<const std::shared_ptr<GlooStore> &, int, int,
std::shared_ptr<GlooOptions> &>(),
.def(py::init<const std::shared_ptr<paddle::distributed::Store> &, int,
int, std::shared_ptr<GlooOptions> &>(),
py::call_guard<py::gil_scoped_release>())
.def(py::init([](const std::shared_ptr<GlooStore> &store, int rank,
int world_size) {
.def(py::init([](const std::shared_ptr<paddle::distributed::Store> &store,
int rank, int world_size) {
auto opts = GlooOptions::create();
char *ifname = getenv(GLOO_SOCKET_IFNAME_ENV.c_str());
if (ifname && strlen(ifname) > 1) {
Expand Down
4 changes: 1 addition & 3 deletions python/paddle/fluid/tests/unittests/process_group_gloo.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,7 @@ def test_create_process_group_gloo(self):
is_master = True if rank == 0 else False
store = paddle.fluid.core.TCPStore("127.0.0.1", 6172, is_master,
nranks, datetime.timedelta(0))
gloo_store = paddle.fluid.core.GlooStore(store)
opt = paddle.fluid.core.GlooOptions()
pg = paddle.fluid.core.ProcessGroupGloo(gloo_store, rank, nranks)
pg = paddle.fluid.core.ProcessGroupGloo(store, rank, nranks)

# test allreduce sum
# rank 0
Expand Down