Skip to content

[RFC]: EPLB Execution Optimization From pr 18343 #20805

@david6666666

Description

@david6666666

Motivation.

#18343

  • EPLB Execution
  • Parallelize the rearrangement algorithm (calculating new expert mapping, not the communication)
  • Shuffle one layer at once and use multiple steps, to lower the impact on inter-token latency
  • Investigate should we pre-allocate expert weight buffer used for transferring
  • Take locality into consideration in expert weight transmission, e.g. prioritize transferring to GPUs on the same node
  • Use cuda.Stream() asynchronously moves the weight to the buffer

Proposed Change.

Image

Advantages over the synchronous solution: TTFT/TPOT fluctuations are small. The synchronous solution will block reasoning when reordering, causing TTFT/TPOT time to increase by hundreds of milliseconds.

Feedback Period.

July 11th ~ July 30th

CC List.

@abmfy

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCunstaleRecieved activity after being labelled stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions