Skip to content

StepMesh performance issue #42

@JF-D

Description

@JF-D

I used StepMesh in vLLM AFD, but met some performance issue. From the following timeline, we could see that the GPU spend a lot of time waiting for the tensor from another node (I am using 1A1F, both are H20 GPUs).

Image

I tried to benchmark the performance of StepMesh with the provided script bmk_comm_latency_multiserver. The performance is as follows (bs=1, num_tokens=128, hidden_size=7168), and looks good.

comm bmk gpu=6: mean=0.373ms, p50=0.379ms, p99=0.405ms, max=0.431ms
         gpu=6 push 1:
                python_req      0.011   0.011   0.019   0.035 0.024 0.023
                req_send        0.000   0.000   0.000   0.001 0.001 0.001
                req_recv        0.001   0.001   0.006   0.009 0.008 0.007
                process         0.001   0.001   0.001   0.001 0.001 0.001
                rsp_send        0.000   0.000   0.000   0.001 0.000 0.000
                rsp_recv        0.001   0.001   0.002   0.008 0.007 0.003
                python_rsp      0.188   0.191   0.209   0.237 0.213 0.213
                net_cost        0.170   0.174   0.191   0.195 0.195 0.194
         gpu=6 push 2:
                python_req      0.013   0.013   0.021   0.037 0.028 0.025
                req_send        0.000   0.000   0.000   0.007 0.000 0.000
                req_recv        0.002   0.001   0.007   0.010 0.010 0.009
                process         0.000   0.000   0.003   0.004 0.004 0.003
                rsp_send        0.000   0.000   0.000   0.000 0.000 0.000
                rsp_recv        0.001   0.001   0.002   0.009 0.006 0.006
                python_rsp      0.179   0.183   0.200   0.226 0.211 0.204
                net_cost        0.177   0.179   0.198   0.201 0.201 0.200
         gpu=6 pull 1:
                python_req      0.015   0.014   0.023   0.038 0.029 0.027
                req_send        0.000   0.000   0.000   0.000 0.000 0.000
                req_recv        0.003   0.002   0.010   0.016 0.016 0.011
                process         0.015   0.015   0.028   0.057 0.029 0.029
                rsp_send        0.000   0.000   0.001   0.005 0.001 0.001
                rsp_recv        0.001   0.001   0.004   0.017 0.009 0.007
                python_rsp      0.000   0.000   0.001   0.007 0.005 0.001
                net_cost        0.337   0.343   0.364   0.369 0.368 0.365

But when I integrate StepMesh into vLLM, it becomes:

�comm bmk gpu=1: mean=8.455ms, p50=9.079ms, p99=13.990ms, max=14.514ms
�	 gpu=1 push 1:
�		python_req	0.427	0.317	1.263	1.395 1.126 1.067
�		req_send	0.001	0.000	0.007	0.009 0.006 0.005
�		req_recv	0.002	0.001	0.018	0.018 0.017 0.013
�		process  	0.001	0.001	0.014	0.014 0.013 0.007
�		rsp_send	0.001	0.000	0.009	0.011 0.006 0.006
�		rsp_recv	4.963	3.974	12.229	12.533 11.912 10.704
�		python_rsp	2.871	1.556	10.981	11.901 10.025 10.022
�		net_cost	0.189	0.187	0.221	0.221 0.221 0.215
�	 gpu=1 pull 1:
�		python_req	0.430	0.320	1.266	1.397 1.131 1.070
�		req_send	0.001	0.000	0.003	0.003 0.002 0.002
�		req_recv	0.009	0.004	0.079	0.080 0.077 0.073
�		process  	7.631	8.282	12.954	13.344 12.549 11.677
�		rsp_send	0.001	0.001	0.008	0.009 0.006 0.005
�		rsp_recv	0.006	0.002	0.094	0.174 0.011 0.010
�		python_rsp	0.002	0.001	0.030	0.051 0.008 0.002
�		net_cost	0.376	0.376	0.390	0.392 0.387 0.383

or

�comm bmk gpu=6: mean=14.235ms, p50=9.928ms, p99=103.375ms, max=104.748ms
�	 gpu=6 push 1:
�		python_req	4.858	0.319	93.255	95.012 91.426 32.661
�		req_send	0.001	0.000	0.005	0.006 0.005 0.004
�		req_recv	0.002	0.001	0.017	0.018 0.016 0.016
�		process  	0.001	0.001	0.015	0.015 0.014 0.006
�		rsp_send	0.001	0.000	0.005	0.007 0.004 0.003
�		rsp_recv	5.099	3.963	10.586	10.595 10.578 10.576
�		python_rsp	4.080	1.676	34.992	59.008 9.997 9.977
�		net_cost	0.194	0.193	0.216	0.216 0.215 0.202
�	 gpu=6 pull 1:
�		python_req	4.861	0.321	93.257	95.014 91.428 32.663
�		req_send	0.000	0.000	0.002	0.002 0.001 0.001
�		req_recv	0.008	0.004	0.092	0.101 0.084 0.062
�		process  	8.967	9.218	36.514	60.609 11.436 11.417
�		rsp_send	0.001	0.000	0.004	0.005 0.004 0.003
�		rsp_recv	0.002	0.002	0.012	0.013 0.011 0.011
�		python_rsp	0.001	0.001	0.010	0.016 0.002 0.002
�		net_cost	0.394	0.393	0.421	0.427 0.415 0.414

I am wondering if you could offer some insights to resolve this issue? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions