Skip to content

Conversation

@TheBasy
Copy link
Collaborator

@TheBasy TheBasy commented Sep 22, 2025

Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background.

New CLI argument:

  • --enable-eplb-rebalance-async: enables asynchronous rebalancing mode

Implementation details:

  • Launch background thread to:
    • Broadcast logical_count
    • Compute ExpertLocationMetadata
    • Store result in _rebalance_result
  • Use TP barrier via gloo cpu_group (send_single_signal / recv_single_signal) to ensure all ranks atomically enter the counter-swap phase
  • Introduce yield-based generator to keep decoding loop non-blocking
  • Model state transfer starts only after TP-wide agreement via _begin_transfer
  • Sync mode remains unchanged: uses blocking single-thread rebalance

This change improves latency stability under dynamic load conditions in MoE models.

Motivation

In Mixture-of-Experts (MoE) models, the Expert Parallel Load Balancer (EPLB) periodically performs load analysis to ensure balanced expert utilization across devices. However, in the current synchronous implementation, this rebalancing process blocks the decoding loop, introducing unpredictable latency spikes—especially under dynamic workloads where frequent rebalance decisions are required. This blocking behavior degrades end-to-end inference performance and undermines the predictability of token generation, which is critical for real-time or interactive applications.

To address this issue, we propose an asynchronous rebalancing mechanism that decouples the computationally intensive load analysis from the token generation pipeline. By offloading rebalance computation to a background thread, the main decoding loop remains non-blocking, enabling continuous token production while maintaining accurate load balancing. This enhancement improves latency stability and system responsiveness under dynamic load conditions, without compromising the correctness or convergence of the rebalancing logic.

Modifications

  • Introduced a new CLI argument --enable-eplb-rebalance-async to enable asynchronous rebalancing mode. When disabled (default), the original blocking behavior is preserved for backward compatibility.
  • Implemented an asynchronous rebalancing workflow using a dedicated background thread that:
  • Broadcasts the local logical_count across tensor parallel (TP) ranks.
  • Computes ExpertLocationMetadata based on global load information.
  • Stores the computed result in a shared _rebalance_result field for later application.
  • Integrated synchronization via TP-wide CPU barrier using Gloo's cpu_group, leveraging send_single_signal and recv_single_signal primitives to ensure all ranks atomically enter the counter-swap phase, avoiding race conditions.
  • Refactored the decoding loop into a yield-based generator pattern to maintain non-blocking execution; model state transfer is deferred until all ranks reach agreement through the _begin_transfer flag.
  • Preserved the original synchronous mode (async=False) with no changes to its single-threaded, blocking rebalance logic, ensuring consistency and ease of comparison.
  • These modifications enable seamless integration of async rebalancing into the existing EPLB framework while maintaining correctness, scalability, and ease of debugging.

Accuracy Tests

Benchmarking and Profiling

Checklist

Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background.

New CLI argument:
- --enable-eplb-rebalance-async: enables asynchronous rebalancing mode

Implementation details:
- Launch background thread to:
  - Broadcast logical_count
  - Compute ExpertLocationMetadata
  - Store result in _rebalance_result
- Use TP barrier via gloo cpu_group (send_single_signal / recv_single_signal) to ensure all ranks atomically enter the counter-swap phase
- Introduce yield-based generator to keep decoding loop non-blocking
- Model state transfer starts only after TP-wide agreement via _begin_transfer
- Sync mode remains unchanged: uses blocking single-thread rebalance

This change improves latency stability under dynamic load conditions in MoE models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants