feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing #1638

PeaBrane · 2025-06-25T08:09:55Z

Overview:

Fairly high-priority KV router perf update. Makes it stable and performant across a wider range of hit rates. Still not quite up to par with round-robin in the general setting, but getting close.

Do not normalize waiting requests.
Hand-wavy intuition is as such: if waiting_requests is to be a proxy of the token load for max_num_batched_tokens budget, then only the absolute deltas among workers should matter, instead of the relative deltas.
Perform predictive updates of waiting_requests as done in Rust land.
A small side-note unrelated to this PR: the predictive update of active_kv_blocks does not seem to be hooked up directly to the Rust routing scheduler, which uses gpu_cache_usage_perc

(not shown in plot, but normalized + predictive waiting load updates also gave unstable results)

Few more updates:

Normalized the overlap score by kv_total_blocks, not ISL, to make the units consistent with gpu_usage_perc
Introduced a softmax sampling on the worker logits to reduce thrashing, which was shown to improve stability and performance empirically

Summary by CodeRabbit

New Features
- Improved worker selection by introducing predictive tracking of waiting requests, enhancing request routing responsiveness.
Refactor
- Updated the handling of waiting requests metrics to use raw values instead of normalized counts, simplifying the routing logic.

copy-pr-bot · 2025-06-27T06:55:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

PeaBrane added 2 commits June 25, 2025 00:28

predictive load updates (waiting requests)

0b76d9b

do not normalize waiting requests

62015ef

PeaBrane requested review from GuanLuo, alec-flowers, biswapanda, grahamking, hhzhang16, ishandhanani, julienmancuso, kkranen, mohammedabdulwahhab, nnshah1, piotrm-nvidia, ptarasiewiczNV, rmccorm4, tanmayv25 and tedzhouhk as code owners June 25, 2025 08:09

pull-request-size bot added the size/M label Jun 25, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 25, 2025 08:10 Inactive

github-actions bot added the feat label Jun 25, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 25, 2025 08:10 Inactive

PeaBrane added enhancement New feature or request priority::high High priority issue perf python Pull requests that update python code labels Jun 25, 2025

PeaBrane requested review from jthomson04, oandreeva-nv, paulhendricks, ryanolson and tmonty12 as code owners June 25, 2025 08:11

copy-pr-bot bot temporarily deployed to GITLAB June 26, 2025 20:37 Inactive

github-actions bot added the feat label Jun 26, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 26, 2025 20:42 Inactive

use only active blocks

62c867c

pull-request-size bot added size/L and removed size/M labels Jun 27, 2025

copy-pr-bot bot temporarily deployed to GITLAB June 27, 2025 01:20 Inactive

softmax sampling

6fba0d8

copy-pr-bot bot temporarily deployed to GITLAB June 27, 2025 06:09 Inactive

recover deleted router args

fc8c6de

update rust router

2a3122f

pull-request-size bot added size/XL and removed size/L labels Jun 27, 2025

PeaBrane added 4 commits June 27, 2025 00:23

cleanups

ab5ce4e

isort

c15efe4

Merge branch 'main' into rupei/python-router-enhance

2a6884e

fmt

94fe195

PeaBrane enabled auto-merge (squash) June 27, 2025 07:59

PeaBrane changed the title ~~feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust)~~ feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing Jun 27, 2025

mypy is the worst

92da014

PeaBrane disabled auto-merge June 27, 2025 08:31

PeaBrane enabled auto-merge (squash) June 27, 2025 08:59

PeaBrane merged commit 8392e7a into main Jun 27, 2025
9 checks passed

PeaBrane deleted the rupei/python-router-enhance branch June 27, 2025 09:02

jthomson04 mentioned this pull request Jun 27, 2025

fix: Little kv routing fix #1677

Merged

coderabbitai bot mentioned this pull request Jul 5, 2025

feat: predictive active blocks for routing without load metrics #1731

Merged

coderabbitai bot mentioned this pull request Aug 4, 2025

feat: Router replicas with state-sharing #2264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing #1638

feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing #1638

Uh oh!

PeaBrane commented Jun 25, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing #1638

feat: Unnormalize waiting requests + predictive load updates for Python router (mirroring Rust) + softmax sampling to reduce thrashing #1638

Uh oh!

Conversation

PeaBrane commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jun 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PeaBrane commented Jun 25, 2025 •

edited

Loading