Revert "Update deep_ep intranode & internode kernels (#74284)" #76090
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Communication Library
PR Types
Bug fixes
Description
#74284 被发现存在多用户共用机器时出现精度diff的问题,暂时无法解决,因此revert
问题表现:当两个用户在同一组机器(两人用的机器是重合的)上同时运行 deep_ep 时,internode dispatch 会出现隐蔽的精度diff,且不会显式报hang;该bug用2机开2个窗口分别启动单测即可100%复现;只有一个用户独占机器时则无此现象
问题分析:估计是多用户共用时 RDMA 发串了;但很难说这个是误写的bug,还是官方压根没考虑支持这种情况,因为我看了文档、commit 记录和 issue 区,都没有提过 deep_ep 能不能同时起多个实例,这个行为本身就有点未定义;由于官方后续也没有修复,目前只能暂时 revert 回去
TODO:测试原版的 deep_ep 看看有没有此问题,如果有,说明 deep_ep 确实不能共用;如果没有,说明是我的PR出了问题
Pcard-85711