Add the first implememtation of fusion_group op #19621

Xreki · 2019-09-04T04:01:24Z

这个PR的工作：

添加对libcuda.so和libnvrtc.so是否存在的检查，DeviceCode的Compile()，若调用CUDA driver函数或者NVRTC函数失败时，不会直接退出，而是会返回false；Launch()调用CUDA driver函数失败依旧直接退出。
添加DeviceCodePool的实现，管理所有生成的核函数。
添加fusion_group op的第一版实现，其Kernel会调用到DeviceCode的Launch()接口。
按照Implement a pass detect fusion group of elementwise op #19884 中的修改意见，修改fusion_group op中的输入和输出参数名。

…kernel using nvrtc. test=develop

test=develop

…ad_per_thread. test=develop

test=develop

…up op. test=develop

test=develop

…e dump when failing to call the CUDA driver API.

test=develop

zhaoyuchen2018 · 2019-10-30T03:51:14Z

paddle/fluid/operators/fused/fusion_group_op.h

+        platform::DeviceCodePool::Instance().Get(place, func_name);
+    VLOG(3) << "func_name: " << func_name;
+
+    if (type == 0) {


type==elementwise_relu

elementwise_add/mul之类的，认为是elementwise类型的二元操作，relu/sigmoid之类的，认为是elementwise类型的一元操作。所以elementwise_relu不合适。

type属性，目前是留个空间，后面可能会支持更多类型的计算模式。目前type对op的计算kernel实现没有作用，主要是用来决定InferShape如何检查各个输入Tensor的dims，以及推导输出Tensor的dims。

先用int吧，后面和pass里面统一一下，考虑用一个枚举类型。但是因为现在只支持这一种类型，所以还没想好。

zhaoyuchen2018 · 2019-10-30T04:00:33Z

paddle/fluid/operators/fused/fusion_group_op_test.cc

+}
+
+extern "C" __global__
+void elementwise_cuda_kernel_0(size_t n, float *x, float* y, float* z) {


改成elementwise_relu_cuda_kernel比较容易懂

如上，我觉得elementwise_relu不合适。

test=develop

* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc. test=develop * Call CUDA driver api to launch the kernel compiled by nvrtc. test=develop * Disable for mac and windows. test=develop * Refine the codes to support manually specified num_threads and workload_per_thread. test=develop * Refine the CUDA kernel to support large dims. test=develop * Add DeviceCodePool to manage all device codes. * Add the first implementation fusion_group op. * Add unit-test for fusion_group op. * Add the check of result. * Add the check of nvrtc in unit-test. test=develop * Add comment to explain the inputs, outputs and features of fusion_group op. test=develop * Disable fusion_group op for mac and windows. test=develop * Make the compiling of device code return status instead of hanging up. test=develop * Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API. * Unify fusion_group_op's input and output names. test=develop * Add the check of CUDA driver library in unittest. test=develop

test=develop

zhaoyuchen2018 · 2020-01-03T04:13:22Z

paddle/fluid/operators/fused/fusion_group_op.h

+    }
+
+    std::string func_name = ctx.Attr<std::string>("func_name");
+    platform::DeviceCode* dev_code =


这个dev_code需要每次compute的时候都去查找吗？网络创建完后就不会变化了吧，如果比较多的话查找会不会很慢

每次compute都要去pool里面拿到，就是访问map，相比program其他那些map访问来说，这个应该算是少的，开销应该还好。如果后面发现这个有性能问题，可以再来改进，比如直接把DeviceCode的指针保存到op信息里面。

zhaoyuchen2018

LGTM

* Add the first implememtation of fusion_group op PaddlePaddle#19621 (#3) * Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc. test=develop * Call CUDA driver api to launch the kernel compiled by nvrtc. test=develop * Disable for mac and windows. test=develop * Refine the codes to support manually specified num_threads and workload_per_thread. test=develop * Refine the CUDA kernel to support large dims. test=develop * Add DeviceCodePool to manage all device codes. * Add the first implementation fusion_group op. * Add unit-test for fusion_group op. * Add the check of result. * Add the check of nvrtc in unit-test. test=develop * Add comment to explain the inputs, outputs and features of fusion_group op. test=develop * Disable fusion_group op for mac and windows. test=develop * Make the compiling of device code return status instead of hanging up. test=develop * Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API. * Unify fusion_group_op's input and output names. test=develop * Add the check of CUDA driver library in unittest. test=develop * Enable generating code for a given subgraph. PaddlePaddle#21126 (#4) * Enable generating code for a given subgraph. * Support sorting the subgraph. * Remove the rearange of expressions because we use the sorted subgraph directly. * Enable generating code for a subgraph which is composed of grad ops. * Use expression information to check the accuracy in unittest. * Separate load and store from computation expressions. test=develop * Improve the loading statements in generated codes. test=develop * Remove unused arguments from formal list. test=develop * Enable the detection of subgraph of grad ops. * Generate code for detected subgraph in fusion_group_pass. * Add an option in BuildStrategy to enable fusion_group_pass and add unittest. test=develop * Fix a bug when checking whether the shape of all inputs are the same. * Add debug information. * Remove subgraph_detector from inference/analysis to the common framework/ir directory. (#5) test=develop * Call subgraph_detector in fusion_group pass. test=develop * Disable fusion_group when WITH_GPU is OFF. test=develop * Refine all PADDLE_ENFORCE message. test=develop * Fix the case that some inputs are not defined in grad ops, and set op_role for fused op. test=develop * add backward gradient computation for op argsort (PaddlePaddle#22203) * add backward gradient computation for op argsort test=developo * use pre-commit test=develop * fix the bug of profile update (PaddlePaddle#22207) * fix the bug of profile update test=develop * add NotImplementedError for multi optimizers (PaddlePaddle#22181) * add NotImplementedError for multi optimizers used on multi-places . test=develop * assert error only if num_devices>1. test=develop * set test_optimizer_in_control_flow in CMakeLists for using multi-GPU.test=develop * support fluid-lite subgraph run resnet test=develop (PaddlePaddle#22191) - 添加了fluid-lite子图方式运行resnet的单测 - 修改了依赖Lite的git commit id * fix bug fot test_dygraph_mnist_fp16.py, test=develop (PaddlePaddle#22222) * Check dygraph weight name (PaddlePaddle#22140) * add parameter check; test=develop * change parameter name checker in dygraph guard; test=develop * fix test layers error; test=develop * revert some code to develop; test=develop * fix exampel error; test=develop * fix comment error; test=develop * fix comment error; test=develop * only import used test case and function(PaddlePaddle#22208) Co-authored-by: FlyingQianMM <245467267@qq.com> Co-authored-by: wangchaochaohu <wangchao66@baidu.com> Co-authored-by: liym27 <33742067+liym27@users.noreply.github.com> Co-authored-by: Wilber <jiweibo1028@outlook.com> Co-authored-by: zhongpu <2013000149@qq.com> Co-authored-by: hong <43953930+phlrain@users.noreply.github.com> Co-authored-by: Zhang Ting <709968123@qq.com>

* Add the first implememtation of fusion_group op #19621 (#3) * Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc. test=develop * Call CUDA driver api to launch the kernel compiled by nvrtc. test=develop * Disable for mac and windows. test=develop * Refine the codes to support manually specified num_threads and workload_per_thread. test=develop * Refine the CUDA kernel to support large dims. test=develop * Add DeviceCodePool to manage all device codes. * Add the first implementation fusion_group op. * Add unit-test for fusion_group op. * Add the check of result. * Add the check of nvrtc in unit-test. test=develop * Add comment to explain the inputs, outputs and features of fusion_group op. test=develop * Disable fusion_group op for mac and windows. test=develop * Make the compiling of device code return status instead of hanging up. test=develop * Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API. * Unify fusion_group_op's input and output names. test=develop * Add the check of CUDA driver library in unittest. test=develop * Enable generating code for a given subgraph. #21126 (#4) * Enable generating code for a given subgraph. * Support sorting the subgraph. * Remove the rearange of expressions because we use the sorted subgraph directly. * Enable generating code for a subgraph which is composed of grad ops. * Use expression information to check the accuracy in unittest. * Separate load and store from computation expressions. test=develop * Improve the loading statements in generated codes. test=develop * Remove unused arguments from formal list. test=develop * Enable the detection of subgraph of grad ops. * Generate code for detected subgraph in fusion_group_pass. * Add an option in BuildStrategy to enable fusion_group_pass and add unittest. test=develop * Fix a bug when checking whether the shape of all inputs are the same. * Add debug information. * Remove subgraph_detector from inference/analysis to the common framework/ir directory. (#5) test=develop * Call subgraph_detector in fusion_group pass. test=develop * Disable fusion_group when WITH_GPU is OFF. test=develop * Refine all PADDLE_ENFORCE message. test=develop * Fix the case that some inputs are not defined in grad ops, and set op_role for fused op. test=develop * Follow review comments. test=develop

Xreki added 13 commits August 26, 2019 15:45

Add the dynamic load of nvrtc, and support runtime compiling of CUDA …

6f26abb

…kernel using nvrtc. test=develop

Call CUDA driver api to launch the kernel compiled by nvrtc.

8c04f5b

test=develop

Merge branch 'develop' into fuse_jit_compile

1dad4e6

test=develop

Disable for mac and windows.

fc953b6

test=develop

Refine the codes to support manually specified num_threads and worklo…

b5fc76c

…ad_per_thread. test=develop

Merge branch 'develop' into fuse_jit_compile

a32fc9a

Refine the CUDA kernel to support large dims.

3c879d8

test=develop

Merge branch 'develop' into fuse_fusion_group

e0ac413

Add DeviceCodePool to manage all device codes.

dc9947f

Add the first implementation fusion_group op.

9c51937

Add unit-test for fusion_group op.

7becebd

Add the check of result.

9f0c215

Merge branch 'develop' into fuse_fusion_group

cfc91e1

test=develop

Xreki force-pushed the fuse_fusion_group branch from 93928ab to cfc91e1 Compare September 4, 2019 10:46

Xreki added 4 commits September 4, 2019 18:55

Add the check of nvrtc in unit-test.

07aeb1a

test=develop

Add comment to explain the inputs, outputs and features of fusion_gro…

583a254

…up op. test=develop

Disable fusion_group op for mac and windows.

56bb1f1

test=develop

Merge branch 'develop' into fuse_fusion_group

6b3a03e

test=develop

Xreki force-pushed the fuse_fusion_group branch from 2cbdb48 to 336179a Compare September 5, 2019 09:33

Make the compiling of device code return status instead of hanging up.

a1fd12e

test=develop

Xreki force-pushed the fuse_fusion_group branch from 336179a to a1fd12e Compare September 6, 2019 06:19

Xreki added 3 commits October 29, 2019 16:48

Merge branch 'develop' into fuse_fusion_group

121a919

Add the check of whether there is CUDA driver library, and do not cor…

b907704

…e dump when failing to call the CUDA driver API.

Unify fusion_group_op's input and output names.

fb8ddc9

test=develop

zhaoyuchen2018 reviewed Oct 30, 2019

View reviewed changes

Xreki added 2 commits October 30, 2019 14:31

Merge branch 'develop' into fuse_fusion_group

b1fb85a

Add the check of CUDA driver library in unittest.

3e73678

test=develop

Merge branch 'develop' into fuse_fusion_group

fc8a6eb

test=develop

Xreki force-pushed the fuse_fusion_group branch 2 times, most recently from 1d0c120 to f040c70 Compare December 27, 2019 07:35

Refine the calling of PADDLE_ENFORCE.

cf97946

test=develop

Xreki force-pushed the fuse_fusion_group branch from f040c70 to cf97946 Compare December 27, 2019 11:30

zhaoyuchen2018 reviewed Jan 3, 2020

View reviewed changes

zhaoyuchen2018 self-requested a review January 3, 2020 06:01

zhaoyuchen2018 approved these changes Jan 3, 2020

View reviewed changes

Xreki merged commit d483207 into PaddlePaddle:develop Jan 3, 2020

Xreki deleted the fuse_fusion_group branch January 7, 2020 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the first implememtation of fusion_group op #19621

Add the first implememtation of fusion_group op #19621

Xreki commented Sep 4, 2019 •

edited

Loading

zhaoyuchen2018 Oct 30, 2019

Xreki Oct 30, 2019 •

edited

Loading

zhaoyuchen2018 Oct 30, 2019

Xreki Oct 30, 2019

zhaoyuchen2018 Jan 3, 2020

Xreki Jan 3, 2020

zhaoyuchen2018 left a comment

Add the first implememtation of fusion_group op #19621

Add the first implememtation of fusion_group op #19621

Conversation

Xreki commented Sep 4, 2019 • edited Loading

zhaoyuchen2018 Oct 30, 2019

Choose a reason for hiding this comment

Xreki Oct 30, 2019 • edited Loading

Choose a reason for hiding this comment

zhaoyuchen2018 Oct 30, 2019

Choose a reason for hiding this comment

Xreki Oct 30, 2019

Choose a reason for hiding this comment

zhaoyuchen2018 Jan 3, 2020

Choose a reason for hiding this comment

Xreki Jan 3, 2020

Choose a reason for hiding this comment

zhaoyuchen2018 left a comment

Choose a reason for hiding this comment

Xreki commented Sep 4, 2019 •

edited

Loading

Xreki Oct 30, 2019 •

edited

Loading