Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize logsumexp in small data scale #52952

Merged
merged 27 commits into from
Jun 5, 2023

Conversation

Asthestarsfalll
Copy link
Contributor

@Asthestarsfalll Asthestarsfalll commented Apr 16, 2023

PR types

Performance optimization

PR changes

OPs

Description

optimize logsumexp in small data scale

具体思路为每个线程处理ColsPerThread个数据,当数据规模过小,启动线程组数量太少时,每个线程还会额外处理多行以提高指令并行
image

当前前向性能如下(1000次运行取平均值) :

Case No. device input_shape input_type New Paddle Perf(ms) diff with original Paddle diff with PyTorch
1 Tesla V100 [64L, 64L] float32 0.003245 faster than 1555.2% faster than 840.06%
2 Tesla V100 [1024L, 512L] float32 0.004887 faster than 14769.2% faster than 696.6%
3 Tesla V100 [64L, 64L] float16 0.0032332 faster than 1517.9% faster than 875.2%
4 Tesla V100 [1024L, 512L] float16 0.0045824 faster than 15773.3% faster than 715.7%

关联PR:#52509
计算方式:(old_time - new_time) / new_time

@paddle-bot
Copy link

paddle-bot bot commented Apr 16, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Apr 16, 2023
@Asthestarsfalll
Copy link
Contributor Author

@JamesLim-sy 老师可以先审一下吗?

@JamesLim-sy
Copy link
Contributor

@JamesLim-sy 老师可以先审一下吗?

这两天有点忙,今晚上会给出我的review建议

HANDLE_THREAD_GROUP(29)
HANDLE_THREAD_GROUP(30)
HANDLE_THREAD_GROUP(31)
HANDLE_THREAD_GROUP(32)
Copy link
Contributor

@JamesLim-sy JamesLim-sy Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分的展开有点暴力啊,能否改成RowsPerThread作为参数而非模板参数传入,但是在__global__ kernel直接将 Local Array 开到,Max value of RowsPerThread and max value of ColsPerThread],但是我觉得还需要注意一个问题,double类型是否会导致使用过量的local memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel启动失败的话再用LogsumexpFallbackKernel执行?

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Apr 29, 2023

Sorry to inform you that d0b8f5d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

out[cur_row + row_id] =
static_cast<SourceType>(log(warp_sum[row_id]) + warp_max[row_id]);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

观察到这里的写出操作是不连续的,是否可以改成向量化部分,向量化连续写出,不可向量化的部分,采用非连续写出;或者采用threadIdx_0 写出 data_0, data_32, data_64;threadIdx_1 写出 data_1, data_33, data_65,类似这样的操作,避免掉每个线程的写出的 stride = RowsPerThread 这种操作

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

老师这里是什么意思,没有看明白

@Asthestarsfalll
Copy link
Contributor Author

@JamesLim-sy 老师,CI已通过,麻烦审核一下

@luotao1
Copy link
Contributor

luotao1 commented May 24, 2023

image ROCM流水线编译失败

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#51835 (comment)
这个PR也修过ROCM的编译问题,看看有没有可参考的

@Asthestarsfalll
Copy link
Contributor Author

@luotao1 @JamesLim-sy ci已通过

@Asthestarsfalll
Copy link
Contributor Author

emmmm

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Jun 1, 2023

Sorry to inform you that e518f38's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@luotao1
Copy link
Contributor

luotao1 commented Jun 1, 2023

@Asthestarsfalll 辛苦再merge下develop,重跑下CI

@Asthestarsfalll
Copy link
Contributor Author

@JamesLim-sy @luotao1 ci已通过

@Asthestarsfalll
Copy link
Contributor Author

@JamesLim-sy PTAL

Copy link
Contributor

@JamesLim-sy JamesLim-sy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Great job!

@JamesLim-sy JamesLim-sy merged commit 93e1bb9 into PaddlePaddle:develop Jun 5, 2023
@Asthestarsfalll Asthestarsfalll deleted the optimize_logsumexp branch August 19, 2023 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants