-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Open
Labels
Description
Motivation.
- EPLB Execution
- Parallelize the rearrangement algorithm (calculating new expert mapping, not the communication)
- Shuffle one layer at once and use multiple steps, to lower the impact on inter-token latency
- Investigate should we pre-allocate expert weight buffer used for transferring
- Take locality into consideration in expert weight transmission, e.g. prioritize transferring to GPUs on the same node
- Use cuda.Stream() asynchronously moves the weight to the buffer
Proposed Change.
Advantages over the synchronous solution: TTFT/TPOT fluctuations are small. The synchronous solution will block reasoning when reordering, causing TTFT/TPOT time to increase by hundreds of milliseconds.
Feedback Period.
July 11th ~ July 30th
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.