aysnc send/recv, seriliaze/deserialize using threadpool. #7705

helinwang · 2018-01-20T00:17:01Z

This PR added performance optimization: send/recv can use threadpool to serialize/deserialize tensor. (before this PR tries to add retry logic, now this PR only handles the threadpool performance optimization).

Fix: #7801

gongweibao · 2018-01-20T04:12:15Z

paddle/operators/detail/grpc_client.cc

+      auto rpc =
+          s->stub_->AsyncSendVariable(s->context_.get(), *(req.get()), &cq_);
+      // Finish will block until failed or the response is received.
+      rpc->Finish(&s->reply_, &s->status_, (void*)s);


我看了一下文档，忽然有一个疑惑，既然这里是block住的（以前我看例子以为是Finish只是注册一下reply，status, tag等,Next才是block的。）

This function will return when either: when the server's response message and status have been received. when the server has returned a non-OK status (no message expected in this case). when the call failed for some reason and the library generated a non-OK status.

那为何还需要completion_queue.Next(), 这里的Finish是说收到了消息但是消息还没有全部返回完毕？
我觉得grpc这个接口这么设计有点反人类。

我们还是需要找一个正经的Async接口。^_^

而且，现在server端是通过计数来生成名字的。/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/recv_op.cc#L83

一旦重复发送就出错了。/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/recv_op.cc#L133

感觉既然叫Finish也是有道理的。

我测试了一下,这里的server response应该不是user level的消息，应该是grpc的框架把caller的数据发送完毕以后接收到response,提示数据发送完毕或者其他情况。

This function will return when either: when the server's response message and status have been received. when the server has returned a non-OK status (no message expected in this case). when the call failed for some reason and the library generated a non-OK status.

测试方式是：即便server端不处理client端发送的消息，只要框架成功接收，client端也是收到了server response. Server response is not user RPC reply.

不过以前的实现确实不是fully async。需要thread_pool去发送。

另外，猜测：Status可能是被复用的。RPC的过程中一直在被更新。

谢谢！看来Finish并不收到处理结果。

gongweibao · 2018-01-20T04:12:55Z

paddle/operators/detail/grpc_client.cc

+      new sendrecv::VariableMessage());
+  SerializeToMessage(var_name, var, ctx, req.get());
+
+  std::thread thread([req, var_h, ch, time_out, this] {


如果参数多会有太多的thread生成出来，可能会有问题。

嗯，不建议在每次call中创建thread或者可以用threadpool

Thanks! Done.

Sorry for later review, the threads number was set by std::thread::hardware_concurrency in the global ThreadPool, it's usually equal to CPU core numbers but not always. I think it's too few for IO threads, maybe we need another threadpool to deal with the IO thread. But we can merge this PR firstly, and do more improve for future.

helinwang · 2018-01-23T22:49:49Z

@gongweibao @typhoonzero thanks for the feedback! I removed the retry code since the error can not be reproduced anymore on the latest develop branch. And the retry logic need more thinking before we implement it.

This PR now only do concurrent send / recv, serialization / deserialization.

typhoonzero

LGTM! Just for reminding, parrallel_for also use the same thread pool, it's a singleton. When we need to do multi-thread + multi-node training we must separate these two threadpools.

gongweibao

赞
LGTM++

helinwang · 2018-01-24T17:57:51Z

Thanks! @typhoonzero @Yancey1989 I would not worry too much about the threadpool, because this PR for the most part only run serialization and deserialization on the threadpool. The send and recv are scheduled internally by grpc, since @gongweibao implemented grpc aysnc send/recv. I know there is a grpc::Finish() call in our threadpool, but @gongweibao tested it does not block until the server returns a response.

gongweibao · 2018-01-25T06:46:17Z

but @gongweibao tested it does not block until the server returns a response.

Some explanations:
The client does blocks until the server returns a response. The client doesn't block until the server returns an RPC reply.
A response is not an RPC reply.

helinwang requested review from gongweibao and typhoonzero January 20, 2018 00:17

gongweibao requested changes Jan 20, 2018

View reviewed changes

aysnc send/recv, seriliaze/deserialize using threadpool

97e68a3

helinwang force-pushed the async branch from 968aabb to 97e68a3 Compare January 23, 2018 22:43

helinwang changed the title ~~Send OP: send / get var asynchronous, retry when send / get failed.~~ Send OP: send / get var asynchronous. Jan 23, 2018

helinwang changed the title ~~Send OP: send / get var asynchronous.~~ aysnc send/recv, seriliaze/deserialize using threadpool. Jan 23, 2018

implement paralell deserialization correctly

7596d68

helinwang force-pushed the async branch from b92ef59 to 7596d68 Compare January 24, 2018 02:06

typhoonzero approved these changes Jan 24, 2018

View reviewed changes

gongweibao approved these changes Jan 24, 2018

View reviewed changes

Yancey1989 merged commit 1ab1181 into PaddlePaddle:develop Jan 24, 2018

helinwang deleted the async branch January 24, 2018 17:58

helinwang mentioned this pull request Jan 24, 2018

Threadpool needs to have two instances, one for calculations the other one for doing IO (e.g. RPC calls). #7828

Closed

typhoonzero mentioned this pull request Jan 25, 2018

Distributed train error in develop branch paddle.v2.fluid.core.EnforceNotMet: at [/paddle/paddle/operators/send_op.cc:47] #7878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aysnc send/recv, seriliaze/deserialize using threadpool. #7705

aysnc send/recv, seriliaze/deserialize using threadpool. #7705

helinwang commented Jan 20, 2018 •

edited

Loading

gongweibao Jan 20, 2018 •

edited

Loading

gongweibao Jan 20, 2018

typhoonzero Jan 20, 2018

gongweibao Jan 20, 2018 •

edited

Loading

gongweibao Jan 23, 2018

helinwang Jan 23, 2018

gongweibao Jan 20, 2018

typhoonzero Jan 20, 2018

helinwang Jan 23, 2018

Yancey1989 Jan 24, 2018 •

edited

Loading

helinwang commented Jan 23, 2018

typhoonzero left a comment

gongweibao left a comment

helinwang commented Jan 24, 2018 •

edited

Loading

gongweibao commented Jan 25, 2018 •

edited

Loading

aysnc send/recv, seriliaze/deserialize using threadpool. #7705

aysnc send/recv, seriliaze/deserialize using threadpool. #7705

Conversation

helinwang commented Jan 20, 2018 • edited Loading

gongweibao Jan 20, 2018 • edited Loading

Choose a reason for hiding this comment

gongweibao Jan 20, 2018

Choose a reason for hiding this comment

typhoonzero Jan 20, 2018

Choose a reason for hiding this comment

gongweibao Jan 20, 2018 • edited Loading

Choose a reason for hiding this comment

gongweibao Jan 23, 2018

Choose a reason for hiding this comment

helinwang Jan 23, 2018

Choose a reason for hiding this comment

gongweibao Jan 20, 2018

Choose a reason for hiding this comment

typhoonzero Jan 20, 2018

Choose a reason for hiding this comment

helinwang Jan 23, 2018

Choose a reason for hiding this comment

Yancey1989 Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

helinwang commented Jan 23, 2018

typhoonzero left a comment

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

helinwang commented Jan 24, 2018 • edited Loading

gongweibao commented Jan 25, 2018 • edited Loading

helinwang commented Jan 20, 2018 •

edited

Loading

gongweibao Jan 20, 2018 •

edited

Loading

gongweibao Jan 20, 2018 •

edited

Loading

Yancey1989 Jan 24, 2018 •

edited

Loading

helinwang commented Jan 24, 2018 •

edited

Loading

gongweibao commented Jan 25, 2018 •

edited

Loading