Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use multi-thread eigen while run on mobile device #6751

Closed
wants to merge 16 commits into from
Closed

use multi-thread eigen while run on mobile device #6751

wants to merge 16 commits into from

Conversation

hjchen2
Copy link
Contributor

@hjchen2 hjchen2 commented Dec 19, 2017

移动端使用多线程Eigen,加速inference。下图为不同线程数下MobileNet的测试结果(测试机为标准版小米MI5,其中两个cpu核锁频到1363MHz,另外两个cpu核锁频到1401MHz):

framework speed cpu memory size
paddlepaddle 353ms 25% 210M 3M
paddlepaddle(2 threads) 290ms 42% 210M 3M
paddlepaddle(4 threads) 253ms 50% 210M 3M

非deepwise卷积使用Eigen两线程加速比2x左右,四线程加速3x左右,但由于有将近140ms被batch normalization消耗,所以总体加速不是很高。

@@ -12,6 +12,12 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */

#ifdef _OPENMP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_OPENMP和非_OPENMP分支有什么区别?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry,刚开始是在这里判断是不是使用openmp,但后来就将这个分支判断挪到ThreadsNumManager了。此处_OPENMP的分支确实需要移除

@@ -44,6 +50,17 @@ paddle_error paddle_init(int argc, char** argv) {
return kPD_NO_ERROR;
}

paddle_error paddle_set_num_threads(int n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个接口可能不是很有必要,真实场景中,用户一般也不会清楚该把threads设置为多少。用多少个线程做多线程计算,这个需要paddle针对每个op的计算量自己计算。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,最好的情况是框架能够根据计算量和op计算类型自动调整线程数,但现在paddle还无法做到。这个接口我觉得还是可以有的,起码测试性能时不用每次都改代码~~~


namespace paddle {

int GetAndroidCpuCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要声明

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


int GetAndroidCpuCount();

int GetOSXCpuCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@@ -70,7 +72,11 @@ struct EigenBlasGemm {
dims[0].first = transA ? 0 : 1;
dims[0].second = transB ? 1 : 0;

Eigen::DefaultDevice device;
#if defined(__ANDROID__) || defined(__OSX__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看编译的时候有一个EIGEN_USE_THREADS,为什么不用这个宏?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我主要考虑的是在移动端用起来,服务器端也可以设置EIGEN_USE_THREADS来支持多线程计算,但这个应该需要和trainer count一起考虑下怎么设置线程数

Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以先把EigenBlasGemm::compute修改成支持多线程的,API接口部分暂时不用修改。

@hjchen2
Copy link
Contributor Author

hjchen2 commented Dec 20, 2017

@hedaoyuan 我改完后重新提交了,帮忙review一下,谢谢~

#include "capi_private.h"
#include "main.h"
#include "paddle/function/EigenDevice.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不需要修改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,好的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public:
static void Set(int n) {
#ifdef _OPENMP
omp_set_num_threads(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两种设置多线程的方式性能上有什么区别?我看编译选项中并没有添加-fopenmp,_OPENMP方式什么时候会被用到?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在还不会用到,因为用openmp的话,需要用g++编译器,但g++编译后的效率比clang差的比较多。我之前尝试过使用openmp多线程优化NeonDepthwiseConv,但NeonDepthwiseConv占用的时间并不是很多,远比不上g++编译器带来的性能损失,所以就弃用了openmp,但还是保留了openmp设置线程数的方式。另外eigen也支持直接使用设置的openmp线程数创建线程池。如果以后也不考虑使用openmp的话,这里确实可以去掉_OPENMP的分支。

#include <sys/types.h>
#endif

// #include <android/log.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25这行删掉吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

}
int rank0, rank1;
int num = fscanf(fp, "%d-%d", &rank0, &rank1);
// __android_log_print(ANDROID_LOG_DEBUG, "Paddle",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无用的代码删掉吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@hjchen2 hjchen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix style and remove openmp support

}
int rank0, rank1;
int num = fscanf(fp, "%d-%d", &rank0, &rank1);
// __android_log_print(ANDROID_LOG_DEBUG, "Paddle",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#include "capi_private.h"
#include "main.h"
#include "paddle/function/EigenDevice.h"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -38,7 +38,7 @@ if(NOT WITH_TIMER)
endif(NOT WITH_TIMER)

if(USE_EIGEN_FOR_BLAS)
add_definitions(-DPADDLE_USE_EIGEN_FOR_BLAS)
add_definitions(-DPADDLE_USE_EIGEN_FOR_BLAS -DEIGEN_USE_THREADS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉-DEIGEN_USE_THREADS,默认还是用单线程的计算。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#ifdef EIGEN_USE_THREADS
const Eigen::ThreadPoolDevice& device = GetThreadPoolDevice();
#else
const Eigen::DefaultDevice device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支编译失败。另外,这里可以考虑另写一个多线程的Gemm接口。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#endif

const Eigen::ThreadPoolDevice& GetThreadPoolDevice() {
int num_threads = ThreadsNumManager::Get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要设置线程数等于CPU核数,遇到一些8核或10核的系统,性能反而变差。这里可以考虑直接把num_threads直接设置为2或者4吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,最大设为2了

@CLAassistant
Copy link

CLAassistant commented May 24, 2018

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


chenhoujiang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants