Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] wrong match between depend and c_allreduce_sum #53089

Merged

Conversation

kangguangli
Copy link
Contributor

@kangguangli kangguangli commented Apr 19, 2023

PR types

Bug fixes

PR changes

Others

Description

Background

#51989 合入后,QA反馈PaddleNLP_gpt3_bs16_fp16_DP2-MP4-PP1_N1C8打印不出日志,经排查确认是hang住了。原因是在修改时假设了所有c_allreduce_sum都由RawProgramOptimizer插入,实际并非如此。因此只能为RawProgramOptimizer插入的c_allreduce_sum建立依赖,插入depend算子。

Detail Changes

  1. 修复了 [Perf] remove sync_calc_stream and sync_comm_stream #51989 中的问题,根据fused_var插入depend算子
  2. 在修复后,发现PaddleNLP_gpt3_bs16_fp16_DP4-MP8-PP1性能下降10%,经分析发现在去掉 c_sync_calc_stream后,NcclAllReduce kernel执行时间上升,原因暂不明确。由于临近发版,因此采取临时修复方案,RawProgramOptimizer在插入通信算子时根据FLAGS_sync_before_allreduce采取两套不同逻辑,设置flag则遵照 [Perf] remove sync_calc_stream and sync_comm_stream #51989 之前的方案。后续查明原因后再删除这一flag。

Others

Card-68266

@paddle-bot
Copy link

paddle-bot bot commented Apr 19, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kangguangli kangguangli merged commit f0f5866 into PaddlePaddle:develop Apr 24, 2023
@kangguangli kangguangli deleted the fix_raw_program_optimizer branch May 19, 2023 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants