-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
I used StepMesh in vLLM AFD, but met some performance issue. From the following timeline, we could see that the GPU spend a lot of time waiting for the tensor from another node (I am using 1A1F, both are H20 GPUs).
I tried to benchmark the performance of StepMesh with the provided script bmk_comm_latency_multiserver. The performance is as follows (bs=1, num_tokens=128, hidden_size=7168), and looks good.
comm bmk gpu=6: mean=0.373ms, p50=0.379ms, p99=0.405ms, max=0.431ms
gpu=6 push 1:
python_req 0.011 0.011 0.019 0.035 0.024 0.023
req_send 0.000 0.000 0.000 0.001 0.001 0.001
req_recv 0.001 0.001 0.006 0.009 0.008 0.007
process 0.001 0.001 0.001 0.001 0.001 0.001
rsp_send 0.000 0.000 0.000 0.001 0.000 0.000
rsp_recv 0.001 0.001 0.002 0.008 0.007 0.003
python_rsp 0.188 0.191 0.209 0.237 0.213 0.213
net_cost 0.170 0.174 0.191 0.195 0.195 0.194
gpu=6 push 2:
python_req 0.013 0.013 0.021 0.037 0.028 0.025
req_send 0.000 0.000 0.000 0.007 0.000 0.000
req_recv 0.002 0.001 0.007 0.010 0.010 0.009
process 0.000 0.000 0.003 0.004 0.004 0.003
rsp_send 0.000 0.000 0.000 0.000 0.000 0.000
rsp_recv 0.001 0.001 0.002 0.009 0.006 0.006
python_rsp 0.179 0.183 0.200 0.226 0.211 0.204
net_cost 0.177 0.179 0.198 0.201 0.201 0.200
gpu=6 pull 1:
python_req 0.015 0.014 0.023 0.038 0.029 0.027
req_send 0.000 0.000 0.000 0.000 0.000 0.000
req_recv 0.003 0.002 0.010 0.016 0.016 0.011
process 0.015 0.015 0.028 0.057 0.029 0.029
rsp_send 0.000 0.000 0.001 0.005 0.001 0.001
rsp_recv 0.001 0.001 0.004 0.017 0.009 0.007
python_rsp 0.000 0.000 0.001 0.007 0.005 0.001
net_cost 0.337 0.343 0.364 0.369 0.368 0.365
But when I integrate StepMesh into vLLM, it becomes:
�comm bmk gpu=1: mean=8.455ms, p50=9.079ms, p99=13.990ms, max=14.514ms
� gpu=1 push 1:
� python_req 0.427 0.317 1.263 1.395 1.126 1.067
� req_send 0.001 0.000 0.007 0.009 0.006 0.005
� req_recv 0.002 0.001 0.018 0.018 0.017 0.013
� process 0.001 0.001 0.014 0.014 0.013 0.007
� rsp_send 0.001 0.000 0.009 0.011 0.006 0.006
� rsp_recv 4.963 3.974 12.229 12.533 11.912 10.704
� python_rsp 2.871 1.556 10.981 11.901 10.025 10.022
� net_cost 0.189 0.187 0.221 0.221 0.221 0.215
� gpu=1 pull 1:
� python_req 0.430 0.320 1.266 1.397 1.131 1.070
� req_send 0.001 0.000 0.003 0.003 0.002 0.002
� req_recv 0.009 0.004 0.079 0.080 0.077 0.073
� process 7.631 8.282 12.954 13.344 12.549 11.677
� rsp_send 0.001 0.001 0.008 0.009 0.006 0.005
� rsp_recv 0.006 0.002 0.094 0.174 0.011 0.010
� python_rsp 0.002 0.001 0.030 0.051 0.008 0.002
� net_cost 0.376 0.376 0.390 0.392 0.387 0.383
or
�comm bmk gpu=6: mean=14.235ms, p50=9.928ms, p99=103.375ms, max=104.748ms
� gpu=6 push 1:
� python_req 4.858 0.319 93.255 95.012 91.426 32.661
� req_send 0.001 0.000 0.005 0.006 0.005 0.004
� req_recv 0.002 0.001 0.017 0.018 0.016 0.016
� process 0.001 0.001 0.015 0.015 0.014 0.006
� rsp_send 0.001 0.000 0.005 0.007 0.004 0.003
� rsp_recv 5.099 3.963 10.586 10.595 10.578 10.576
� python_rsp 4.080 1.676 34.992 59.008 9.997 9.977
� net_cost 0.194 0.193 0.216 0.216 0.215 0.202
� gpu=6 pull 1:
� python_req 4.861 0.321 93.257 95.014 91.428 32.663
� req_send 0.000 0.000 0.002 0.002 0.001 0.001
� req_recv 0.008 0.004 0.092 0.101 0.084 0.062
� process 8.967 9.218 36.514 60.609 11.436 11.417
� rsp_send 0.001 0.000 0.004 0.005 0.004 0.003
� rsp_recv 0.002 0.002 0.012 0.013 0.011 0.011
� python_rsp 0.001 0.001 0.010 0.016 0.002 0.002
� net_cost 0.394 0.393 0.421 0.427 0.415 0.414
I am wondering if you could offer some insights to resolve this issue? Thank you!
Metadata
Metadata
Assignees
Labels
No labels