Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented May 11, 2025

Overview:

Implements a mock worker in Rust simulating a vllm-ish behavior. The core components for now are:

  1. A synchronous KV manager implementing a simple LRU eviction strategy, largely following this implementation
  2. An asynchronous scheduler + simulator all in one that uses the said KV manager, along with other budget considerations like max_num_batched_tokens
  3. The prefill compute is assumed to scale roughly with new_tokens * (new_tokens + cached_tokens) scaled by a dummy magic number

To limit the scope of this PR, it is not hooked up to a mock AsyncEngine or dynamo endpoint yet. Neither are any Python bindings written. But currently, a mock worker can be launched and some meaningful FowardPassMetrics can be generated (as generated by actual vllm workers for KV routing)

Where to start reviewing

The core logics are in:

  1. KvManager.process() in mocker/kv_manager.rs, containing the logic for handling the 4 MoveBlock variants (see below)
  2. The background event loop in Scheduler.new() in mocker/scheduler.rs, handling receiving a request, scheduling a request, and simulating the generation process

Motivation

is two-fold:

  1. To perform predictive routing, based on more accurate predictions of the KvIndexer, beyond the current heuristics.
  2. To implement mock vllm workers for at-scale testings. Most likely use case will be to launch multiple mock vllm workers at once.

Implementation Details

Move Blocks

There is a MoveBlock enum with three variants that can be sent around as events, all handled synchronously by the KV manager

  1. Use: First checks if block is in active pool; if so, increment reference count. Next checks if in inactive pool, if so, move to active. Lastly, try evicting from the inactive pool to make room. If inactive pool is empty, then pre-empt the oldest running request.
  2. Destroy: Simply removes the block from the active pool. This is used as vllm does not cache partial blocks.
  3. Deref: Decrements the reference count of a block in active pool by one. If reference count is zero, move block into inactive pool.
  4. Promote: Promotes a partial block (identified with uuid) into a full block (identified with a global block hash)

Note Use adds blocks from root to leaf, and Destroy and Deref removes blocks from leaf to root.

Evictor

is a modification of the lazy heap introduced in this vllm PR . The gist is as follows:

  1. Use a VecDeque / queue to maintain the blocks, with the order guaranteed by the user when pushing (old blocks first, and leaves first)
  2. Do not update the queue on timestamp update (e.g. when a block is referenced / touched), since it is O(n), just leave the stale entry there. Instead, a separate hash map is used to keep the updated entries.
  3. During removal / eviction, the stale entries will be naturally evicted, as they are "older" than the current ones
  4. If the heap grows beyond a certain threshold (defaults to 50), rebuild the VecDeque from the hashmap

May be memory intensive if evict is rarely called (should probably use a BTreeSet)

Limitation

  1. Does not support chunked prefill, but should not be difficult to add.
  2. Only consider KV budget and batched token budget during scheduling (does not consider say max_requests, which rarely is the bottleneck)

Integration

Will make a near-future effort to integrate with existing components tokens.rs and the recent block_manager. May also make sense to not use too many existing components as a stand-alone mock API. Open to discussion.

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label May 11, 2025
@PeaBrane PeaBrane marked this pull request as draft May 11, 2025 18:07
@PeaBrane PeaBrane mentioned this pull request May 10, 2025
5 tasks
@PeaBrane PeaBrane changed the title feat: vllm evictor in rust feat: vllm mock workers in rust May 12, 2025
@PeaBrane PeaBrane marked this pull request as ready for review May 19, 2025 06:12
Copy link
Contributor

@alec-flowers alec-flowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add headers to each file describing the purpose and goal? You did a great job in the PR description. It would be good to translate that into the code.

Also I think would be useful once its set up to look at what sort of numbers can be generated by running these workers.

We may want to have it where the Block Manager actually emits Events that the KVRouter can receive the signal and do something with. This would mean adding functionality from the KVPublisher to the MockWorker.

@alec-flowers
Copy link
Contributor

We need to see how we can utilize this both for collecting numbers / building heuristics, and also in testing and Mock-ing things.

@PeaBrane
Copy link
Contributor Author

Can you add headers to each file describing the purpose and goal? You did a great job in the PR description. It would be good to translate that into the code.

Also I think would be useful once its set up to look at what sort of numbers can be generated by running these workers.

We may want to have it where the Block Manager actually emits Events that the KVRouter can receive the signal and do something with. This would mean adding functionality from the KVPublisher to the MockWorker.

Awesome. Yea, those would be targets / scopes for near-future PRs. I think I need to first hook this up to an AsyncEngine, then would try to test our current router impls with this mocker

@PeaBrane PeaBrane changed the title feat: vllm mock workers in Rust feat: vllm mock workers, Rusty skeleton May 21, 2025
@PeaBrane PeaBrane enabled auto-merge (squash) May 21, 2025 09:05
@PeaBrane PeaBrane merged commit 03c160a into main May 21, 2025
8 checks passed
@PeaBrane PeaBrane deleted the rupei/vllm-evictor branch May 21, 2025 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants