- 
          
- 
        Couldn't load subscription status. 
- Fork 10.9k
[Metrics] Log multi-modal cache stats and fix reset #26285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics] Log multi-modal cache stats and fix reset #26285
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces metrics for the multi-modal cache, including query and hit counts. The implementation involves a nice refactoring of the existing prefix caching metrics into a more generic CachingMetrics class, which is then used for both prefix and multi-modal cache statistics. The changes are well-structured and the tests have been updated to cover both text and multimodal models. I found one issue in the test suite where an assertion relies on a coincidence in test data, which I've commented on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/v1/core/kv_cache_manager.py
Lines 188 to 194 in db606d9
| if self.log_stats: | |
| assert self.prefix_cache_stats is not None | |
| if request.num_preemptions > 0: | |
| # Previously preempted request | |
| self.prefix_cache_stats.preempted_requests += 1 | |
| self.prefix_cache_stats.preempted_queries += request.num_tokens | |
| self.prefix_cache_stats.preempted_hits += num_new_computed_tokens | 
The refactor in PrefixCacheStats removes the preempted_requests, preempted_queries, and preempted_hits attributes, but KVCacheManager.get_computed_blocks still increments those fields when a request is preempted. Any preempted request with stats logging enabled will now raise AttributeError: 'PrefixCacheStats' object has no attribute 'preempted_requests' before the scheduler returns, breaking request handling whenever preemption occurs. Either keep these attributes on the stats object or stop writing to them.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
| /gemini review | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces metrics for multi-modal cache statistics and refactors the testing framework to support both text and multi-modal models, removing obsolete v0 engine logic. The changes are well-structured, particularly the generalization of caching metrics. However, I've identified a critical issue in one of the metric tests where an expected value is calculated incorrectly, which could lead to false positives in testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| I'm unable to get  | 
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
| /gemini review | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/executor/ray_distributed_executor.py
Lines 382 to 395 in 610625d
| # Initialize the actual workers inside worker wrapper. | |
| all_kwargs = [] | |
| for rank, (node_id, _) in enumerate(worker_node_and_gpu_ids): | |
| local_rank = node_workers[node_id].index(rank) | |
| kwargs = dict( | |
| vllm_config=self.vllm_config, | |
| local_rank=local_rank, | |
| rank=rank, | |
| distributed_init_method=distributed_init_method, | |
| is_driver_worker=(not self.parallel_config) | |
| or (rank % self.parallel_config.tensor_parallel_size == 0), | |
| ) | |
| all_kwargs.append(kwargs) | |
| self._run_workers("init_worker", all_kwargs) | 
The new WorkerWrapperBase.init_worker requires a shared_worker_lock when mm_processor_cache_type='shm' and raises a ValueError if it is missing. Multiprocess and uniprocess executors were updated to pass the lock, but the ray executor still builds all_kwargs without that argument before calling _run_workers("init_worker", all_kwargs). Any model launched via RayDistributedExecutor with SHM multimodal caching will now fail during worker initialization even though it used to run (albeit without the optimization). Please include a shared_worker_lock in the kwargs for each ray worker so the cache can be created instead of crashing.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces metrics for multi-modal cache statistics and refactors the shared memory cache implementation for better code structure. The changes are well-integrated with the existing metrics and logging infrastructure, and new tests have been added to validate the functionality. However, the refactoring introduces critical breaking changes to the V0 engine execution path by modifying shared components like WorkerBase without ensuring backward compatibility. These regressions need to be addressed to prevent breaking existing V0 functionality for users.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
| Seems the distributed test failure is related. | 
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
| 
 Thanks for the cc and sorry for the delay, - metrics side lgtm! | 
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
This is a redo of #16478 . It is much simplified from the original PR as we have recently consolidated the code related to multi-modal cache so that we only need to log the stats from the P0 sender cache.
reset_mm_cacheis not run for SHM cache. (Sinceclear_cachein SHM receiver cache is currently a no-op, this doesn't really matter for now)MultiModalBudgetnot being cleared after engine startupWorkerBase.execute_modelto be in line with V1 codecc @markmc
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.