-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use multi-thread eigen while run on mobile device #6751
Conversation
paddle/capi/Main.cpp
Outdated
@@ -12,6 +12,12 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
See the License for the specific language governing permissions and | |||
limitations under the License. */ | |||
|
|||
#ifdef _OPENMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_OPENMP和非_OPENMP分支有什么区别?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry,刚开始是在这里判断是不是使用openmp,但后来就将这个分支判断挪到ThreadsNumManager了。此处_OPENMP的分支确实需要移除
paddle/capi/Main.cpp
Outdated
@@ -44,6 +50,17 @@ paddle_error paddle_init(int argc, char** argv) { | |||
return kPD_NO_ERROR; | |||
} | |||
|
|||
paddle_error paddle_set_num_threads(int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个接口可能不是很有必要,真实场景中,用户一般也不会清楚该把threads设置为多少。用多少个线程做多线程计算,这个需要paddle针对每个op的计算量自己计算。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对,最好的情况是框架能够根据计算量和op计算类型自动调整线程数,但现在paddle还无法做到。这个接口我觉得还是可以有的,起码测试性能时不用每次都改代码~~~
paddle/function/EigenDevice.h
Outdated
|
||
namespace paddle { | ||
|
||
int GetAndroidCpuCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要声明
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
paddle/function/EigenDevice.h
Outdated
|
||
int GetAndroidCpuCount(); | ||
|
||
int GetOSXCpuCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
paddle/function/EigenGemm.cpp
Outdated
@@ -70,7 +72,11 @@ struct EigenBlasGemm { | |||
dims[0].first = transA ? 0 : 1; | |||
dims[0].second = transB ? 1 : 0; | |||
|
|||
Eigen::DefaultDevice device; | |||
#if defined(__ANDROID__) || defined(__OSX__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我看编译的时候有一个EIGEN_USE_THREADS
,为什么不用这个宏?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里我主要考虑的是在移动端用起来,服务器端也可以设置EIGEN_USE_THREADS来支持多线程计算,但这个应该需要和trainer count一起考虑下怎么设置线程数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以先把EigenBlasGemm::compute修改成支持多线程的,API接口部分暂时不用修改。
@hedaoyuan 我改完后重新提交了,帮忙review一下,谢谢~ |
paddle/capi/Main.cpp
Outdated
#include "capi_private.h" | ||
#include "main.h" | ||
#include "paddle/function/EigenDevice.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不需要修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/function/EigenDevice.h
Outdated
public: | ||
static void Set(int n) { | ||
#ifdef _OPENMP | ||
omp_set_num_threads(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两种设置多线程的方式性能上有什么区别?我看编译选项中并没有添加-fopenmp,_OPENMP
方式什么时候会被用到?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在还不会用到,因为用openmp的话,需要用g++编译器,但g++编译后的效率比clang差的比较多。我之前尝试过使用openmp多线程优化NeonDepthwiseConv,但NeonDepthwiseConv占用的时间并不是很多,远比不上g++编译器带来的性能损失,所以就弃用了openmp,但还是保留了openmp设置线程数的方式。另外eigen也支持直接使用设置的openmp线程数创建线程池。如果以后也不考虑使用openmp的话,这里确实可以去掉_OPENMP的分支。
paddle/function/EigenDevice.cpp
Outdated
#include <sys/types.h> | ||
#endif | ||
|
||
// #include <android/log.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
25这行删掉吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
paddle/function/EigenDevice.cpp
Outdated
} | ||
int rank0, rank1; | ||
int num = fscanf(fp, "%d-%d", &rank0, &rank1); | ||
// __android_log_print(ANDROID_LOG_DEBUG, "Paddle", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
无用的代码删掉吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix style and remove openmp support
paddle/function/EigenDevice.cpp
Outdated
} | ||
int rank0, rank1; | ||
int num = fscanf(fp, "%d-%d", &rank0, &rank1); | ||
// __android_log_print(ANDROID_LOG_DEBUG, "Paddle", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/capi/Main.cpp
Outdated
#include "capi_private.h" | ||
#include "main.h" | ||
#include "paddle/function/EigenDevice.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmake/configure.cmake
Outdated
@@ -38,7 +38,7 @@ if(NOT WITH_TIMER) | |||
endif(NOT WITH_TIMER) | |||
|
|||
if(USE_EIGEN_FOR_BLAS) | |||
add_definitions(-DPADDLE_USE_EIGEN_FOR_BLAS) | |||
add_definitions(-DPADDLE_USE_EIGEN_FOR_BLAS -DEIGEN_USE_THREADS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
去掉-DEIGEN_USE_THREADS
,默认还是用单线程的计算。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/function/EigenGemm.cpp
Outdated
#ifdef EIGEN_USE_THREADS | ||
const Eigen::ThreadPoolDevice& device = GetThreadPoolDevice(); | ||
#else | ||
const Eigen::DefaultDevice device; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个分支编译失败。另外,这里可以考虑另写一个多线程的Gemm接口。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/function/EigenDevice.cpp
Outdated
#endif | ||
|
||
const Eigen::ThreadPoolDevice& GetThreadPoolDevice() { | ||
int num_threads = ThreadsNumManager::Get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要设置线程数等于CPU核数,遇到一些8核或10核的系统,性能反而变差。这里可以考虑直接把num_threads直接设置为2或者4吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,最大设为2了
chenhoujiang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
移动端使用多线程Eigen,加速inference。下图为不同线程数下MobileNet的测试结果(测试机为标准版小米MI5,其中两个cpu核锁频到1363MHz,另外两个cpu核锁频到1401MHz):
非deepwise卷积使用Eigen两线程加速比2x左右,四线程加速3x左右,但由于有将近140ms被batch normalization消耗,所以总体加速不是很高。