-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded MKL for paddle #2379
Comments
Hi @wanglovesyang , I am curious about the Xeon Phi you use. Do you use it as the CPU or as a co-processor? We have never run PaddlePaddle on Xeon Phi and thank you for your try. In PaddlePaddle, when you launch 10 trainers, 10 threads will be created and each one will be assigned a trainer. Thus we use singled-threaded gemm implementation and link |
@Xreki I tried two thread settings, both with trainers=10:
However, both these two settings run slower than single-thread. Even though the cpu usage is close to 100% at most of the time. |
|
长时间没有更新,暂时close;如有进一步更新,欢迎reopen |
I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However, I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources.
Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers. Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?
The text was updated successfully, but these errors were encountered: