[RFC]: EPLB Execution Optimization From pr 18343

### Motivation.

#18343  
- EPLB Execution
- [ ]  Parallelize the rearrangement algorithm (calculating new expert mapping, not the communication)
- [x]  Shuffle one layer at once and use multiple steps, to lower the impact on inter-token latency
- [x]  Investigate should we pre-allocate expert weight buffer used for transferring
- [ ]  Take locality into consideration in expert weight transmission, e.g. prioritize transferring to GPUs on the same node
- [x]  Use cuda.Stream() asynchronously moves the weight to the buffer

### Proposed Change.

<img width="4357" height="11319" alt="Image" src="https://github.com/user-attachments/assets/6e10789f-beca-4e8a-a4be-9e13bf1f9f39" />

Advantages over the synchronous solution: TTFT/TPOT fluctuations are small. The synchronous solution will block reasoning when reordering, causing TTFT/TPOT time to increase by hundreds of milliseconds.

### Feedback Period.

July 11th ~ July 30th

### CC List.

@abmfy 
### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: EPLB Execution Optimization From pr 18343 #20805

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: EPLB Execution Optimization From pr 18343 #20805

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions