-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize logsumexp in small data scale #52952
optimize logsumexp in small data scale #52952
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@JamesLim-sy 老师可以先审一下吗? |
这两天有点忙,今晚上会给出我的review建议 |
HANDLE_THREAD_GROUP(29) | ||
HANDLE_THREAD_GROUP(30) | ||
HANDLE_THREAD_GROUP(31) | ||
HANDLE_THREAD_GROUP(32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分的展开有点暴力啊,能否改成RowsPerThread
作为参数而非模板参数传入,但是在__global__ kernel
直接将 Local Array 开到,Max value of RowsPerThread and max value of ColsPerThread]
,但是我觉得还需要注意一个问题,double类型是否会导致使用过量的local memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kernel启动失败的话再用LogsumexpFallbackKernel
执行?
Sorry to inform you that d0b8f5d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
out[cur_row + row_id] = | ||
static_cast<SourceType>(log(warp_sum[row_id]) + warp_max[row_id]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
观察到这里的写出操作是不连续的,是否可以改成向量化部分,向量化连续写出,不可向量化的部分,采用非连续写出;或者采用threadIdx_0 写出 data_0, data_32, data_64;threadIdx_1 写出 data_1, data_33, data_65,类似这样的操作,避免掉每个线程的写出的 stride = RowsPerThread 这种操作
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
老师这里是什么意思,没有看明白
@JamesLim-sy 老师,CI已通过,麻烦审核一下 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#51835 (comment)
这个PR也修过ROCM的编译问题,看看有没有可参考的
@luotao1 @JamesLim-sy ci已通过 |
emmmm |
Sorry to inform you that e518f38's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
@Asthestarsfalll 辛苦再merge下develop,重跑下CI |
@JamesLim-sy @luotao1 ci已通过 |
@JamesLim-sy PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Great job!
PR types
Performance optimization
PR changes
OPs
Description
optimize logsumexp in small data scale
具体思路为每个线程处理ColsPerThread个数据,当数据规模过小,启动线程组数量太少时,每个线程还会额外处理多行以提高指令并行
data:image/s3,"s3://crabby-images/af44f/af44f19d1ef15e19f983e51e031fd77ca7252660" alt="image"
当前前向性能如下(1000次运行取平均值) :
关联PR:#52509
计算方式:(old_time - new_time) / new_time