Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Jun 25, 2025

Overview:

Fairly high-priority KV router perf update. Makes it stable and performant across a wider range of hit rates. Still not quite up to par with round-robin in the general setting, but getting close.

  1. Do not normalize waiting requests.
    Hand-wavy intuition is as such: if waiting_requests is to be a proxy of the token load for max_num_batched_tokens budget, then only the absolute deltas among workers should matter, instead of the relative deltas.

  2. Perform predictive updates of waiting_requests as done in Rust land.
    A small side-note unrelated to this PR: the predictive update of active_kv_blocks does not seem to be hooked up directly to the Rust routing scheduler, which uses gpu_cache_usage_perc

predictive_unnormalized_waiting
(not shown in plot, but normalized + predictive waiting load updates also gave unstable results)

Few more updates:

  1. Normalized the overlap score by kv_total_blocks, not ISL, to make the units consistent with gpu_usage_perc
  2. Introduced a softmax sampling on the worker logits to reduce thrashing, which was shown to improve stability and performance empirically

Summary by CodeRabbit

  • New Features

    • Improved worker selection by introducing predictive tracking of waiting requests, enhancing request routing responsiveness.
  • Refactor

    • Updated the handling of waiting requests metrics to use raw values instead of normalized counts, simplifying the routing logic.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PeaBrane PeaBrane enabled auto-merge (squash) June 27, 2025 07:59
@PeaBrane PeaBrane changed the title feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing Jun 27, 2025
@PeaBrane PeaBrane disabled auto-merge June 27, 2025 08:31
@PeaBrane PeaBrane enabled auto-merge (squash) June 27, 2025 08:59
@PeaBrane PeaBrane merged commit 8392e7a into main Jun 27, 2025
9 checks passed
@PeaBrane PeaBrane deleted the rupei/python-router-enhance branch June 27, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feat priority::high High priority issue size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants