feat: vllm mock workers, Rusty skeleton #1033

PeaBrane · 2025-05-11T18:07:34Z

Overview:

Implements a mock worker in Rust simulating a vllm-ish behavior. The core components for now are:

A synchronous KV manager implementing a simple LRU eviction strategy, largely following this implementation
An asynchronous scheduler + simulator all in one that uses the said KV manager, along with other budget considerations like max_num_batched_tokens
The prefill compute is assumed to scale roughly with new_tokens * (new_tokens + cached_tokens) scaled by a dummy magic number

To limit the scope of this PR, it is not hooked up to a mock AsyncEngine or dynamo endpoint yet. Neither are any Python bindings written. But currently, a mock worker can be launched and some meaningful FowardPassMetrics can be generated (as generated by actual vllm workers for KV routing)

Where to start reviewing

The core logics are in:

KvManager.process() in mocker/kv_manager.rs, containing the logic for handling the 4 MoveBlock variants (see below)
The background event loop in Scheduler.new() in mocker/scheduler.rs, handling receiving a request, scheduling a request, and simulating the generation process

Motivation

is two-fold:

To perform predictive routing, based on more accurate predictions of the KvIndexer, beyond the current heuristics.
To implement mock vllm workers for at-scale testings. Most likely use case will be to launch multiple mock vllm workers at once.

Implementation Details

Move Blocks

There is a MoveBlock enum with three variants that can be sent around as events, all handled synchronously by the KV manager

Use: First checks if block is in active pool; if so, increment reference count. Next checks if in inactive pool, if so, move to active. Lastly, try evicting from the inactive pool to make room. If inactive pool is empty, then pre-empt the oldest running request.
Destroy: Simply removes the block from the active pool. This is used as vllm does not cache partial blocks.
Deref: Decrements the reference count of a block in active pool by one. If reference count is zero, move block into inactive pool.
Promote: Promotes a partial block (identified with uuid) into a full block (identified with a global block hash)

Note Use adds blocks from root to leaf, and Destroy and Deref removes blocks from leaf to root.

Evictor

is a modification of the lazy heap introduced in this vllm PR . The gist is as follows:

Use a VecDeque / queue to maintain the blocks, with the order guaranteed by the user when pushing (old blocks first, and leaves first)
Do not update the queue on timestamp update (e.g. when a block is referenced / touched), since it is O(n), just leave the stale entry there. Instead, a separate hash map is used to keep the updated entries.
During removal / eviction, the stale entries will be naturally evicted, as they are "older" than the current ones
If the heap grows beyond a certain threshold (defaults to 50), rebuild the VecDeque from the hashmap

May be memory intensive if evict is rarely called (should probably use a BTreeSet)

Limitation

Does not support chunked prefill, but should not be difficult to add.
Only consider KV budget and batched token budget during scheduling (does not consider say max_requests, which rarely is the bottleneck)

Integration

Will make a near-future effort to integrate with existing components tokens.rs and the recent block_manager. May also make sense to not use too many existing components as a stand-alone mock API. Open to discussion.

copy-pr-bot · 2025-05-11T18:07:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

lib/llm/src/mocker/kv_manager.rs

lib/llm/src/mocker/evictor.rs

lib/llm/src/mocker/kv_manager.rs

lib/llm/src/mocker/scheduler.rs

lib/llm/src/mocker/sequence.rs

alec-flowers

Can you add headers to each file describing the purpose and goal? You did a great job in the PR description. It would be good to translate that into the code.

Also I think would be useful once its set up to look at what sort of numbers can be generated by running these workers.

We may want to have it where the Block Manager actually emits Events that the KVRouter can receive the signal and do something with. This would mean adding functionality from the KVPublisher to the MockWorker.

lib/llm/src/mocker/kv_manager.rs

alec-flowers · 2025-05-21T00:34:09Z

We need to see how we can utilize this both for collecting numbers / building heuristics, and also in testing and Mock-ing things.

PeaBrane · 2025-05-21T00:43:02Z

Can you add headers to each file describing the purpose and goal? You did a great job in the PR description. It would be good to translate that into the code.

Also I think would be useful once its set up to look at what sort of numbers can be generated by running these workers.

We may want to have it where the Block Manager actually emits Events that the KVRouter can receive the signal and do something with. This would mean adding functionality from the KVPublisher to the MockWorker.

Awesome. Yea, those would be targets / scopes for near-future PRs. I think I need to first hook this up to an AsyncEngine, then would try to test our current router impls with this mocker

… error)

generic LRU evictor

ee0ce47

PeaBrane requested review from a team, GuanLuo, alec-flowers, biswapanda, grahamking, jthomson04, kkranen, oandreeva-nv, paulhendricks, rmccorm4, ryanolson and tmonty12 as code owners May 11, 2025 18:07

pull-request-size bot added the size/L label May 11, 2025

github-actions bot added the feat label May 11, 2025

PeaBrane marked this pull request as draft May 11, 2025 18:07

PeaBrane mentioned this pull request May 10, 2025

[FEATURE]: a mock worker API #995

Closed

5 tasks

PeaBrane added 2 commits May 12, 2025 00:38

sequence hash with depth (needed for eviction)

6cb8430

small note about derived traits (Ord and PartialOrd)

997404e

PeaBrane changed the title ~~feat: vllm evictor in rust~~ feat: vllm mock workers in rust May 12, 2025

PeaBrane added 9 commits May 12, 2025 12:25

skeleton for mock workers

7c52927

rename to mocker.rs

2864578

rm useless comments

b340b8c

compute seq hashes from block hashes

bf98163

test for seq hash compute

1670658

multi mock workers

dbf0b53

using indexmap

ea59a13

license

6fcee91

active sequence refactor

a064672

PeaBrane added 7 commits May 18, 2025 21:13

no need to expose request sender

3764afa

move the prefill cost logic into scheduler state

05ad9c0

decoding counts for 1 batched token

89f774f

more comments

869c261

Merge branch 'main' into rupei/vllm-evictor

81e1999

Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

small comment on test

728e1fd

unnecessary cast

ec842d9

PeaBrane marked this pull request as ready for review May 19, 2025 06:12

PeaBrane added 3 commits May 19, 2025 02:16

chore: i love let then

2ed8a06

a test with caching, and more stringent asserts in kv manager

5608d84

chore: more denesting

e52030a

jthomson04 reviewed May 19, 2025

View reviewed changes

PeaBrane added 5 commits May 19, 2025 14:31

derive getters

85340e3

remove dummy MoveBlockResponse protocol, just use bool

7f621ac

no need for default evictor

0c89c86

reorganized defaults

0efcd6d

debloat, no task handle and derive clone

bd98cbf

alec-flowers approved these changes May 21, 2025

View reviewed changes

lib/llm/src/mocker/kv_manager.rs Show resolved Hide resolved

lib/llm/src/mocker/kv_manager.rs Show resolved Hide resolved

PeaBrane added 4 commits May 20, 2025 18:11

improved docs + kv cached based decoding time estimation

968f989

integrate with tokens

f38e51a

better but still bugged after interfacing with tokens (cannot promote…

8faa5fc

… error)

fix: reset creation signal as well

b1749ce

PeaBrane changed the title ~~feat: vllm mock workers in Rust~~ feat: vllm mock workers, Rusty skeleton May 21, 2025

PeaBrane enabled auto-merge (squash) May 21, 2025 09:05

don't flood the logs in unit tests

fa280e1

PeaBrane merged commit 03c160a into main May 21, 2025
8 checks passed

PeaBrane deleted the rupei/vllm-evictor branch May 21, 2025 09:46

coderabbitai bot mentioned this pull request May 30, 2025

feat: vllm mocker enhancement #1236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: vllm mock workers, Rusty skeleton #1033

feat: vllm mock workers, Rusty skeleton #1033

Uh oh!

PeaBrane commented May 11, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers left a comment

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented May 21, 2025

Uh oh!

PeaBrane commented May 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: vllm mock workers, Rusty skeleton #1033

feat: vllm mock workers, Rusty skeleton #1033

Uh oh!

Conversation

PeaBrane commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Where to start reviewing

Motivation

Implementation Details

Move Blocks

Evictor

Limitation

Integration

Uh oh!

copy-pr-bot bot commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alec-flowers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alec-flowers commented May 21, 2025

Uh oh!

PeaBrane commented May 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PeaBrane commented May 11, 2025 •

edited

Loading