-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Update deep_ep intranode & internode kernels #74284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
117bd81 to
5f846fc
Compare
5f846fc to
a3d0d9e
Compare
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
XiaoguangHu01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…4284)" (PaddlePaddle#76090) This reverts commit e2a8155.
…Paddle#74284)" (PaddlePaddle#76090)" This reverts commit e5f8345.
PR Category
Communication Library
PR Types
Performance
Description
将 intranode & internode 的底层 kernel 更新至官方commit:deepseek-ai/DeepEP@079c5a4 (7月14日)
该 commit 已包含 TMA 优化 internode 性能
本PR修改内容
将
intranode.cu、internode.cu、configs.cuh、ibgda_device.cuh直接拷贝过来将
launch.cuh、utils.cuh拷贝过来,但保留 low_latency 仍然依赖的 deprecated 的函数(low_latency 由推理同学维护,不做修改)将
runtime.cu和layout.cu拷贝过来,合并成一个runtime.cu(之前也是这样合并的)将
api.cuh中 intranode & internode 的部分拷贝过来对
deep_ep.hpp中 Buffer 的成员变量做小幅修改对
deep_ep.cpp中 Buffer 的构造函数和 sync 方法,以及涉及 intranode & internode 调用的地方做了修改,正确设置新增的成员变量,适配新的 CUDA 层接口在
types.h里增加一个 helper 方法正确性测试
使用 test_intranode.py 和 test_internode.py(2、4、8机)进行了单测,均通过
使用DeepseekV3进行了多种PP、EP配置的端到端收敛性测试,均通过
性能变化
新版的优势在于可以用更少的SM达到相同的通信带宽,从而为计算分配更多的SM
例如在DeepseekV3上,deepep sm 20->14, deepgemm sm 112->118,端到端提升 1-2%
Pcard-85711