[BugFix] wrong match between depend and c_allreduce_sum #53089
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Others
Description
Background
在 #51989 合入后,QA反馈PaddleNLP_gpt3_bs16_fp16_DP2-MP4-PP1_N1C8打印不出日志,经排查确认是hang住了。原因是在修改时假设了所有
c_allreduce_sum
都由RawProgramOptimizer
插入,实际并非如此。因此只能为RawProgramOptimizer
插入的c_allreduce_sum
建立依赖,插入depend
算子。Detail Changes
fused_var
插入depend
算子c_sync_calc_stream
后,NcclAllReduce
kernel执行时间上升,原因暂不明确。由于临近发版,因此采取临时修复方案,RawProgramOptimizer
在插入通信算子时根据FLAGS_sync_before_allreduce
采取两套不同逻辑,设置flag则遵照 [Perf] remove sync_calc_stream and sync_comm_stream #51989 之前的方案。后续查明原因后再删除这一flag。Others
Card-68266