-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine concat_op #8669
Refine concat_op #8669
Conversation
0b7e00f
to
b3824ae
Compare
b3824ae
to
1583e78
Compare
be8d8ae
to
1b09ceb
Compare
1b09ceb
to
b5320e4
Compare
b5320e4
to
00e596e
Compare
|
||
// get input's cols | ||
std::vector<int64_t> input_cols(input.size()); | ||
for (int i = 0; i < num; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/strided_memcpy.h#L53
This functor does the same thing, should we use it or delete the same functor?
// assume the the max size of input is less than 8 and see the performance | ||
// save origin dim | ||
int num = outputs.size(); | ||
std::vector<paddle::framework::DDim> origin_dim(num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete namespace of paddle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origin_dim has been removed.
auto& cpu_place = boost::get<platform::CPUPlace>(context.GetPlace()); | ||
|
||
// computation | ||
for (int k = 0; k < input_rows; ++k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with above, it is redundant with the stridememcpywithaxis
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functor doesn't process the GPU data, so it is not redundant with stridememcpywithaxis
.
For GPU data, In some case, the functor is slower than stridememcpywithaxis
.
std::min((out_cols + block_cols - 1) / block_cols, max_blocks); | ||
int grid_rows = | ||
std::min(max_blocks / grid_cols, std::max(out_rows / block_rows, 1)); | ||
dim3 grid_size = dim3(grid_cols, grid_rows, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this calculation is correct. My only concern is that whether we truly need the information of Multi-Processor.
In fact, I have tested the occupancy in the Titan X(pascal arch) with different grid size, block size, finally find that if the block size >= 256, the grid size use upper bound of input size, I will get near 100% occupancy rate in my test.
In a word, if the block size has been >= 256(8 times of warp size) the GPU occupancy and performance will be ok.
My test code is modified based /~https://github.com/zchee/cuda-sample/blob/master/0_Simple/simpleOccupancy/simpleOccupancy.cu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, if we really need the attribute of GPU Device, we can store this information in DeviceContext. It is device related and will save a lot of heavy overhead device query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the grid size use upper bound of input size
Please remind that grid's size is not unlimited. Assuming that the input is x1
and x2
and their shape is (100,1000000), axis is 0, the kernel will fail, you can have a try.
get near 100% occupancy rate
My experience is that, in some case, 100% occupancy doesn't mean the kernel is efficient. My test code is https://gist.github.com/chengduoZH/bc20fa8c12f8f045b74240dcdad41c84 . I also do some experiment and past the result in the comment.
Furthermore, if we really need the attribute of GPU Device, we can store this information in DeviceContext. It is device related and will save a lot of heavy overhead device query.
I agreed that.
f8ada6b
to
9ea6113
Compare
9ea6113
to
131ec27
Compare
bb7abf6
to
c3864ea
Compare
Some experiments about Concat_op
There are three cases of Concat
Experimental data
Experimental method and result:Experimental method: input two shape-like tensors (x1, x2), reshape the shape of the two tensor into 2 dimensions(the number of all tensors' row is the same) according to the axis, repeat 300 the concat operations and count the average time.
Analysis the experiment resultAssuming that the shape of the input data is NxM, when M is far greater than N, the time consumption of directly copying will be less. In other cases, the time of directly using Kernel will be less. Currently, the implementation of concat use the two strategies, directly copying and using CUDA kernel. When the axis is zero and the number of input is less than 10, concat_OP will directly copy, otherwise, concat_OP will use CUDA kernel. |
template class ConcatGradFunctor<platform::CPUDeviceContext, int64_t>; | ||
template class ConcatGradFunctor<platform::CPUDeviceContext, float>; | ||
template class ConcatGradFunctor<platform::CPUDeviceContext, double>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need a functor? We can just write the code into op.cc/cu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, ConcatFunctor
is only used by concat_op, but I think it can be used by SplitOp too. ConcatOp and SplitOp are the two opposite operations. So I define the concat
as a functor.
fix : #8567
Analysis the
concat
operationThe input is a list of tensors and
axis
which indicates the concation axis. The shape of input's tensor can be any, but only the dimension ofaxis
can be different.For example, the input is two tensors.
axis = 0
,Obviously, the output's shape is [12,2,3,4]. To simply solve this case, we can reshape t_a to [9, 24] and t_b to [3, 24], finally concate the two tensor longitudinally. The output's shape is [12, 24]. In this case, we only copy two times.
axis = 1
,To simply solve this case, we can reshape t_a to [9, 2, 12] and t_b to [9, 3, 12], finally concate the two tensor on the second axis. The output's shape is [9,5,12]. In this case, we should copy 18 times.
axis = 3
,Firstly, we reshape t_a to [54, 4] and t_b to [54, 3], finally concate the two tensor horizontally. The output's shape is [54, 7]. This is the worst case, we should copy 108 times.
TODO