-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Fluid Distributed Training performance #8638
Comments
I'm checking |
From TF benchmark's results: It's code at here: The code only uses |
I took a short look at /~https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpc/grpc_tensor_coding.cc#L136. To make clear, we don't need to call |
ENV: Use malloc:
Use cudaHostAlloc
|
基本上已经定位,时间消耗来自于内存拷贝:
2,3,4数据波动比较大 |
We have to make sure when we call gRPC the tensor data is not copied, the buffer should be sent directly. To achieve this, current grpc_server/grpc_client need to be re-rewitten. |
Testing using Will try different serialize methods of variables. |
Talk with @typhoonzero and fix some bugs of the program. |
Maybe we need one (or maybe three more streams: host to device, device to host, device to device) dedicated CUDA stream for copying tensor? |
谢谢伟宝!
序列化成protobuf花的时间有点太长了。 貌似现在最大的瓶颈不在device to host内存拷贝,也不在gRPC,而是protobuf序列化?@gongweibao @typhoonzero |
protobuf序列化是一部分,然后user level的数据拷贝进grpc的空间是另外一部分。我们已经找到了解决的办法,避免2,3两个数据拷贝。根据测试的时间消耗,应该可以解决绝大部分的时间性能问题。 |
赞!那个代码链接我没有看明白,请问能解释下如何做到的吗?不是很理解为何可以避免 |
数据分成了两部分: |
赞!谢谢!原来gRPC原生支持byte slice。 |
没有文档,也就找到了Tensorflow的这一个例子。除了grpc的作者们,估计其他人知道的很少。 |
A summarization of the above discussions: It takes too long time for the Send operator to encode content into protobuf message and copy the content to gRPC buffer. We borrowed ideas from TensorFlow implementation to accelerate these two steps. |
Can make optimization run parallelly. run vgg16 with flowers dataset (fc size is 512) trainer per batch time (seconds):
Here's the server side calculations time (ms):
|
Some notes on testing distributed training when pserver program runs on GPU or not (no zero-copy grpc) environment:
Will add result when zero-copy grpc merged. |
I did some experiments on the speed and throughput of using script:ENV:TITAN X (Pascal), Driver Version: 390.25Use malloc: Use cudaHostAlloc P40 (Pascal), Driver Version: 384.66Use malloc: Use cudaHostAlloc |
Thanks for the data points! |
ENV: K40m, Cuda8.0, driver:390.12,384.66
It seems that hardware arch, not the Cuda version or driver version affects the copy speed. |
Latest updates: after finish above optimizations, run vgg16 with fc size 512, distributed training with GPU can gain 64% overall performance of the theoretical performance, when increase the fc size to 4096, it go down to 33%. It seems that the larger the tensor is, the slower the distributed training is. That means send_op and listen_and_serv_op takes too much time transfering data over the network. |
进展:
看上去需要用多链接也就是多channel来链接server会性能提升。我在同一台机器测试client和server的吞吐,单个进程可以到1GB/s左右;m个client和m个server一对一可以到2GB。 看上去单send/recv的时间还有50%的时间可以搞掉! |
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持! |
As shown in #8550, send_op tooks too much time of GPU distributed training, here's some tips we need to do to improve the performance:
The text was updated successfully, but these errors were encountered: