StepMesh performance issue

I used StepMesh in vLLM AFD, but met some performance issue. From the following timeline, we could see that the GPU spend a lot of time waiting for the tensor from another node (I am using 1A1F, both are H20 GPUs).

<img width="1041" height="546" alt="Image" src="https://github.com/user-attachments/assets/64eb7fff-ab39-48db-bfea-4c10d31fb361" />

I tried to benchmark the performance of StepMesh with the provided script `bmk_comm_latency_multiserver`. The performance is as follows (bs=1, num_tokens=128, hidden_size=7168), and looks good.
```
comm bmk gpu=6: mean=0.373ms, p50=0.379ms, p99=0.405ms, max=0.431ms
         gpu=6 push 1:
                python_req      0.011   0.011   0.019   0.035 0.024 0.023
                req_send        0.000   0.000   0.000   0.001 0.001 0.001
                req_recv        0.001   0.001   0.006   0.009 0.008 0.007
                process         0.001   0.001   0.001   0.001 0.001 0.001
                rsp_send        0.000   0.000   0.000   0.001 0.000 0.000
                rsp_recv        0.001   0.001   0.002   0.008 0.007 0.003
                python_rsp      0.188   0.191   0.209   0.237 0.213 0.213
                net_cost        0.170   0.174   0.191   0.195 0.195 0.194
         gpu=6 push 2:
                python_req      0.013   0.013   0.021   0.037 0.028 0.025
                req_send        0.000   0.000   0.000   0.007 0.000 0.000
                req_recv        0.002   0.001   0.007   0.010 0.010 0.009
                process         0.000   0.000   0.003   0.004 0.004 0.003
                rsp_send        0.000   0.000   0.000   0.000 0.000 0.000
                rsp_recv        0.001   0.001   0.002   0.009 0.006 0.006
                python_rsp      0.179   0.183   0.200   0.226 0.211 0.204
                net_cost        0.177   0.179   0.198   0.201 0.201 0.200
         gpu=6 pull 1:
                python_req      0.015   0.014   0.023   0.038 0.029 0.027
                req_send        0.000   0.000   0.000   0.000 0.000 0.000
                req_recv        0.003   0.002   0.010   0.016 0.016 0.011
                process         0.015   0.015   0.028   0.057 0.029 0.029
                rsp_send        0.000   0.000   0.001   0.005 0.001 0.001
                rsp_recv        0.001   0.001   0.004   0.017 0.009 0.007
                python_rsp      0.000   0.000   0.001   0.007 0.005 0.001
                net_cost        0.337   0.343   0.364   0.369 0.368 0.365
```
But when I integrate StepMesh into vLLM, it becomes:
```
comm bmk gpu=1: mean=8.455ms, p50=9.079ms, p99=13.990ms, max=14.514ms
	 gpu=1 push 1:
		python_req	0.427	0.317	1.263	1.395 1.126 1.067
		req_send	0.001	0.000	0.007	0.009 0.006 0.005
		req_recv	0.002	0.001	0.018	0.018 0.017 0.013
		process  	0.001	0.001	0.014	0.014 0.013 0.007
		rsp_send	0.001	0.000	0.009	0.011 0.006 0.006
		rsp_recv	4.963	3.974	12.229	12.533 11.912 10.704
		python_rsp	2.871	1.556	10.981	11.901 10.025 10.022
		net_cost	0.189	0.187	0.221	0.221 0.221 0.215
	 gpu=1 pull 1:
		python_req	0.430	0.320	1.266	1.397 1.131 1.070
		req_send	0.001	0.000	0.003	0.003 0.002 0.002
		req_recv	0.009	0.004	0.079	0.080 0.077 0.073
		process  	7.631	8.282	12.954	13.344 12.549 11.677
		rsp_send	0.001	0.001	0.008	0.009 0.006 0.005
		rsp_recv	0.006	0.002	0.094	0.174 0.011 0.010
		python_rsp	0.002	0.001	0.030	0.051 0.008 0.002
		net_cost	0.376	0.376	0.390	0.392 0.387 0.383
```
or 
```
comm bmk gpu=6: mean=14.235ms, p50=9.928ms, p99=103.375ms, max=104.748ms
	 gpu=6 push 1:
		python_req	4.858	0.319	93.255	95.012 91.426 32.661
		req_send	0.001	0.000	0.005	0.006 0.005 0.004
		req_recv	0.002	0.001	0.017	0.018 0.016 0.016
		process  	0.001	0.001	0.015	0.015 0.014 0.006
		rsp_send	0.001	0.000	0.005	0.007 0.004 0.003
		rsp_recv	5.099	3.963	10.586	10.595 10.578 10.576
		python_rsp	4.080	1.676	34.992	59.008 9.997 9.977
		net_cost	0.194	0.193	0.216	0.216 0.215 0.202
	 gpu=6 pull 1:
		python_req	4.861	0.321	93.257	95.014 91.428 32.663
		req_send	0.000	0.000	0.002	0.002 0.001 0.001
		req_recv	0.008	0.004	0.092	0.101 0.084 0.062
		process  	8.967	9.218	36.514	60.609 11.436 11.417
		rsp_send	0.001	0.000	0.004	0.005 0.004 0.003
		rsp_recv	0.002	0.002	0.012	0.013 0.011 0.011
		python_rsp	0.001	0.001	0.010	0.016 0.002 0.002
		net_cost	0.394	0.393	0.421	0.427 0.415 0.414
```
I am wondering if you could offer some insights to resolve this issue? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StepMesh performance issue #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StepMesh performance issue #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions