Skip to content

Conversation

@lshpku
Copy link
Contributor

@lshpku lshpku commented Oct 28, 2025

PR Category

Communication Library

PR Types

Bug fixes

Description

#74284 被发现存在多用户共用机器时出现精度diff的问题,暂时无法解决,因此revert

问题表现:当两个用户在同一组机器(两人用的机器是重合的)上同时运行 deep_ep 时,internode dispatch 会出现隐蔽的精度diff,且不会显式报hang;该bug用2机开2个窗口分别启动单测即可100%复现;只有一个用户独占机器时则无此现象

问题分析:估计是多用户共用时 RDMA 发串了;但很难说这个是误写的bug,还是官方压根没考虑支持这种情况,因为我看了文档、commit 记录和 issue 区,都没有提过 deep_ep 能不能同时起多个实例,这个行为本身就有点未定义;由于官方后续也没有修复,目前只能暂时 revert 回去

develop分支已完成1/2/4/8机测试,bug消除

TODO:测试原版的 deep_ep 看看有没有此问题,如果有,说明 deep_ep 确实不能共用;如果没有,说明是我的PR出了问题

Pcard-85711

@paddle-bot
Copy link

paddle-bot bot commented Oct 28, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@risemeup1 risemeup1 merged commit e2a8155 into PaddlePaddle:develop Oct 29, 2025
58 of 60 checks passed
zyfncg added a commit to zyfncg/Paddle that referenced this pull request Nov 12, 2025
zyfncg added a commit to zyfncg/Paddle that referenced this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants