-
Notifications
You must be signed in to change notification settings - Fork 676
feat: vllm mock workers, Rusty skeleton #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
alec-flowers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add headers to each file describing the purpose and goal? You did a great job in the PR description. It would be good to translate that into the code.
Also I think would be useful once its set up to look at what sort of numbers can be generated by running these workers.
We may want to have it where the Block Manager actually emits Events that the KVRouter can receive the signal and do something with. This would mean adding functionality from the KVPublisher to the MockWorker.
|
We need to see how we can utilize this both for collecting numbers / building heuristics, and also in testing and Mock-ing things. |
Awesome. Yea, those would be targets / scopes for near-future PRs. I think I need to first hook this up to an |
Overview:
Implements a mock worker in Rust simulating a vllm-ish behavior. The core components for now are:
max_num_batched_tokensnew_tokens * (new_tokens + cached_tokens)scaled by a dummy magic numberTo limit the scope of this PR, it is not hooked up to a mock
AsyncEngineor dynamo endpoint yet. Neither are any Python bindings written. But currently, a mock worker can be launched and some meaningfulFowardPassMetricscan be generated (as generated by actual vllm workers for KV routing)Where to start reviewing
The core logics are in:
KvManager.process()inmocker/kv_manager.rs, containing the logic for handling the 4MoveBlockvariants (see below)Scheduler.new()inmocker/scheduler.rs, handling receiving a request, scheduling a request, and simulating the generation processMotivation
is two-fold:
KvIndexer, beyond the current heuristics.Implementation Details
Move Blocks
There is a
MoveBlockenum with three variants that can be sent around as events, all handled synchronously by the KV managerUse: First checks if block is in active pool; if so, increment reference count. Next checks if in inactive pool, if so, move to active. Lastly, try evicting from the inactive pool to make room. If inactive pool is empty, then pre-empt the oldest running request.Destroy: Simply removes the block from the active pool. This is used asvllmdoes not cache partial blocks.Deref: Decrements the reference count of a block in active pool by one. If reference count is zero, move block into inactive pool.Promote: Promotes a partial block (identified with uuid) into a full block (identified with a global block hash)Note
Useadds blocks from root to leaf, andDestroyandDerefremoves blocks from leaf to root.Evictor
is a modification of the lazy heap introduced in this vllm PR . The gist is as follows:
VecDeque/ queue to maintain the blocks, with the order guaranteed by the user when pushing (old blocks first, and leaves first)VecDequefrom the hashmapMay be memory intensive if
evictis rarely called (should probably use aBTreeSet)Limitation
max_requests, which rarely is the bottleneck)Integration
Will make a near-future effort to integrate with existing components
tokens.rsand the recentblock_manager. May also make sense to not use too many existing components as a stand-alone mock API. Open to discussion.