feat: introduce async rebalance mode for dynamic EPLB #3

TheBasy · 2025-09-22T09:51:32Z

Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background.

New CLI argument:

--enable-eplb-rebalance-async: enables asynchronous rebalancing mode

Implementation details:

Launch background thread to:
- Broadcast logical_count
- Compute ExpertLocationMetadata
- Store result in _rebalance_result
Use TP barrier via gloo cpu_group (send_single_signal / recv_single_signal) to ensure all ranks atomically enter the counter-swap phase
Introduce yield-based generator to keep decoding loop non-blocking
Model state transfer starts only after TP-wide agreement via _begin_transfer
Sync mode remains unchanged: uses blocking single-thread rebalance

This change improves latency stability under dynamic load conditions in MoE models.

Motivation

In Mixture-of-Experts (MoE) models, the Expert Parallel Load Balancer (EPLB) periodically performs load analysis to ensure balanced expert utilization across devices. However, in the current synchronous implementation, this rebalancing process blocks the decoding loop, introducing unpredictable latency spikes—especially under dynamic workloads where frequent rebalance decisions are required. This blocking behavior degrades end-to-end inference performance and undermines the predictability of token generation, which is critical for real-time or interactive applications.

To address this issue, we propose an asynchronous rebalancing mechanism that decouples the computationally intensive load analysis from the token generation pipeline. By offloading rebalance computation to a background thread, the main decoding loop remains non-blocking, enabling continuous token production while maintaining accurate load balancing. This enhancement improves latency stability and system responsiveness under dynamic load conditions, without compromising the correctness or convergence of the rebalancing logic.

Modifications

Introduced a new CLI argument --enable-eplb-rebalance-async to enable asynchronous rebalancing mode. When disabled (default), the original blocking behavior is preserved for backward compatibility.
Implemented an asynchronous rebalancing workflow using a dedicated background thread that:

Broadcasts the local logical_count across tensor parallel (TP) ranks.

Computes ExpertLocationMetadata based on global load information.

Stores the computed result in a shared _rebalance_result field for later application.

Integrated synchronization via TP-wide CPU barrier using Gloo's cpu_group, leveraging send_single_signal and recv_single_signal primitives to ensure all ranks atomically enter the counter-swap phase, avoiding race conditions.
Refactored the decoding loop into a yield-based generator pattern to maintain non-blocking execution; model state transfer is deferred until all ranks reach agreement through the _begin_transfer flag.
Preserved the original synchronous mode (async=False) with no changes to its single-threaded, blocking rebalance logic, ensuring consistency and ease of comparison.
These modifications enable seamless integration of async rebalancing into the existing EPLB framework while maintaining correctness, scalability, and ease of debugging.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background. New CLI argument: - --enable-eplb-rebalance-async: enables asynchronous rebalancing mode Implementation details: - Launch background thread to: - Broadcast logical_count - Compute ExpertLocationMetadata - Store result in _rebalance_result - Use TP barrier via gloo cpu_group (send_single_signal / recv_single_signal) to ensure all ranks atomically enter the counter-swap phase - Introduce yield-based generator to keep decoding loop non-blocking - Model state transfer starts only after TP-wide agreement via _begin_transfer - Sync mode remains unchanged: uses blocking single-thread rebalance This change improves latency stability under dynamic load conditions in MoE models.

TheBasy mentioned this pull request Oct 24, 2025

feat: introduce async rebalance mode for expert load balancer sgl-project/sglang#8529

Open

6 tasks

TheBasy mentioned this pull request Nov 20, 2025

support eplb async d2d sgl-project/sglang#13578

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: introduce async rebalance mode for dynamic EPLB #3

feat: introduce async rebalance mode for dynamic EPLB #3

Uh oh!

TheBasy commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: introduce async rebalance mode for dynamic EPLB #3

Are you sure you want to change the base?

feat: introduce async rebalance mode for dynamic EPLB #3

Uh oh!

Conversation

TheBasy commented Sep 22, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants