feat: introduce async rebalance mode for dynamic EPLB #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background.
New CLI argument:
Implementation details:
This change improves latency stability under dynamic load conditions in MoE models.
Motivation
In Mixture-of-Experts (MoE) models, the Expert Parallel Load Balancer (EPLB) periodically performs load analysis to ensure balanced expert utilization across devices. However, in the current synchronous implementation, this rebalancing process blocks the decoding loop, introducing unpredictable latency spikes—especially under dynamic workloads where frequent rebalance decisions are required. This blocking behavior degrades end-to-end inference performance and undermines the predictability of token generation, which is critical for real-time or interactive applications.
To address this issue, we propose an asynchronous rebalancing mechanism that decouples the computationally intensive load analysis from the token generation pipeline. By offloading rebalance computation to a background thread, the main decoding loop remains non-blocking, enabling continuous token production while maintaining accurate load balancing. This enhancement improves latency stability and system responsiveness under dynamic load conditions, without compromising the correctness or convergence of the rebalancing logic.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist