-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aysnc send/recv, seriliaze/deserialize using threadpool. #7705
Conversation
auto rpc = | ||
s->stub_->AsyncSendVariable(s->context_.get(), *(req.get()), &cq_); | ||
// Finish will block until failed or the response is received. | ||
rpc->Finish(&s->reply_, &s->status_, (void*)s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看了一下文档,忽然有一个疑惑,既然这里是block住的(以前我看例子以为是Finish只是注册一下reply,status, tag等,Next才是block的。)
This function will return when either:
when the server's response message and status have been received.
when the server has returned a non-OK status (no message expected in this case).
when the call failed for some reason and the library generated a non-OK status.
那为何还需要completion_queue.Next()
, 这里的Finish
是说收到了消息但是消息还没有全部返回完毕?
我觉得grpc这个接口这么设计有点反人类。
我们还是需要找一个正经的Async接口。^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉既然叫Finish
也是有道理的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我测试了一下,这里的server response
应该不是user level的消息,应该是grpc的框架把caller的数据发送完毕以后接收到response,提示数据发送完毕或者其他情况。
This function will return when either:
when the server's response message and status have been received.
when the server has returned a non-OK status (no message expected in this case).
when the call failed for some reason and the library generated a non-OK status.
测试方式是:即便server端不处理client端发送的消息,只要框架成功接收,client端也是收到了server response
. Server response is not user RPC reply.
不过以前的实现确实不是fully async
。需要thread_pool去发送。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外,猜测:Status可能是被复用的。RPC的过程中一直在被更新。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢!看来Finish并不收到处理结果。
new sendrecv::VariableMessage()); | ||
SerializeToMessage(var_name, var, ctx, req.get()); | ||
|
||
std::thread thread([req, var_h, ch, time_out, this] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果参数多会有太多的thread生成出来,可能会有问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,不建议在每次call中创建thread或者可以用threadpool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for later review, the threads number was set by std::thread::hardware_concurrency in the global ThreadPool, it's usually equal to CPU core numbers but not always. I think it's too few for IO threads, maybe we need another threadpool to deal with the IO thread. But we can merge this PR firstly, and do more improve for future.
@gongweibao @typhoonzero thanks for the feedback! I removed the retry code since the error can not be reproduced anymore on the latest develop branch. And the retry logic need more thinking before we implement it. This PR now only do concurrent send / recv, serialization / deserialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just for reminding, parrallel_for also use the same thread pool, it's a singleton. When we need to do multi-thread + multi-node training we must separate these two threadpools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赞
LGTM++
Thanks! @typhoonzero @Yancey1989 I would not worry too much about the threadpool, because this PR for the most part only run serialization and deserialization on the threadpool. The send and recv are scheduled internally by grpc, since @gongweibao implemented grpc aysnc send/recv. I know there is a |
Some explanations: |
This PR added performance optimization: send/recv can use threadpool to serialize/deserialize tensor. (before this PR tries to add retry logic, now this PR only handles the threadpool performance optimization).
Fix: #7801