-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[cherry-pick][Distributed] fix recreate nccl comm bug (#73625) #74168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #74168 +/- ##
==========================================
Coverage ? 0.00%
==========================================
Files ? 1
Lines ? 2
Branches ? 0
==========================================
Hits ? 0
Misses ? 2
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run cpu |
|
/re-run all-failed |
|
Sorry to inform you that 727bd39's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
|
/re-run all-failed |
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/re-run Approval |
|
/re-run all-failed |
SylarTiaNII
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
727bd39 to
79fbe18
Compare
PR Category
Distributed Strategy
PR Types
Bug fixes
Description
Pcard-90602
cherry-pick pr: #73625
在开pp的场景下recreate nccl comm存在hang的问题, 原因是同一个通信组内 tcp通信时unique_key的获取是无序的,导致相互等待。当前通过有序map代替无序map来修复这个问题。