Skip to content

Conversation

@waliwali777
Copy link
Contributor

@waliwali777 waliwali777 commented Aug 26, 2025

PR Category

Auto Parallel

PR Types

Bug fixes

Description

背景:
ernie 模型在 pp2 + num_hidden_layers=4 + moe_layer_start_index=3 情况下,

  • stage 0 上不包含 moe 层,所有 grad mesh 都是 [0,1,2,3] ,在 grad clip 中会调用 allreduce 求和所有梯度

  • stage 1 上包含 moe 层,grad 大部分是 [4,5,6,7];存在 moe softmax 的 grad mesh 是 [current_rank],这些在 grad clip 会被 reshard 成 [4,5,6,7],但并不会调用 allreduce 求和所有梯度

解决方法:
grad clip 中 flag_auto_hybrid_pp 控制着是否进行 allreduce 通信
该 PR 修复 stage 1 情况下适配 moe ,设置 flag_auto_hybrid_pp=Truestage0stage1 可以进行 allreduce 通信
card-92763

@paddle-bot
Copy link

paddle-bot bot commented Aug 26, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2025

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@bfbb12d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
python/paddle/nn/clip.py 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #74916   +/-   ##
==========================================
  Coverage           ?    0.00%           
==========================================
  Files              ?        1           
  Lines              ?        1           
  Branches           ?        0           
==========================================
  Hits               ?        0           
  Misses             ?        1           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liym27
liym27 previously approved these changes Aug 28, 2025
Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@waliwali777
Copy link
Contributor Author

/re-run all-failed

3 similar comments
@waliwali777
Copy link
Contributor Author

/re-run all-failed

@waliwali777
Copy link
Contributor Author

/re-run all-failed

@waliwali777
Copy link
Contributor Author

/re-run all-failed

@liym27 liym27 merged commit e8e81ce into PaddlePaddle:develop Aug 28, 2025
96 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants