Skip to content

Conversation

@DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Oct 6, 2025

Purpose

This is a redo of #16478 . It is much simplified from the original PR as we have recently consolidated the code related to multi-modal cache so that we only need to log the stats from the P0 sender cache.

  • Moved SHM cache implementation to WorkerWrapperBase so that there is no need to have separate logic inside UniProcExecutor and MultiprocExecutor.
  • Fix a potential issue where reset_mm_cache is not run for SHM cache. (Since clear_cache in SHM receiver cache is currently a no-op, this doesn't really matter for now)
  • Fix the cache inside MultiModalBudget not being cleared after engine startup
  • Fix the signature of WorkerBase.execute_model to be in line with V1 code

cc @markmc

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces metrics for the multi-modal cache, including query and hit counts. The implementation involves a nice refactoring of the existing prefix caching metrics into a more generic CachingMetrics class, which is then used for both prefix and multi-modal cache statistics. The changes are well-structured and the tests have been updated to cover both text and multimodal models. I found one issue in the test suite where an assertion relies on a coincidence in test data, which I've commented on.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if self.log_stats:
assert self.prefix_cache_stats is not None
if request.num_preemptions > 0:
# Previously preempted request
self.prefix_cache_stats.preempted_requests += 1
self.prefix_cache_stats.preempted_queries += request.num_tokens
self.prefix_cache_stats.preempted_hits += num_new_computed_tokens

P0 Badge Restore preempted prefix cache fields

The refactor in PrefixCacheStats removes the preempted_requests, preempted_queries, and preempted_hits attributes, but KVCacheManager.get_computed_blocks still increments those fields when a request is preempted. Any preempted request with stats logging enabled will now raise AttributeError: 'PrefixCacheStats' object has no attribute 'preempted_requests' before the scheduler returns, breaking request handling whenever preemption occurs. Either keep these attributes on the stats object or stop writing to them.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337 DarkLight1337 marked this pull request as draft October 6, 2025 08:57
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces metrics for multi-modal cache statistics and refactors the testing framework to support both text and multi-modal models, removing obsolete v0 engine logic. The changes are well-structured, particularly the generalization of caching metrics. However, I've identified a critical issue in one of the metric tests where an expected value is calculated incorrectly, which could lead to false positives in testing.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337 DarkLight1337 marked this pull request as ready for review October 6, 2025 09:32
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@DarkLight1337
Copy link
Member Author

I'm unable to get tests/entrypoints/openai/test_metrics.py to pass locally even on main branch, @markmc can you try it?

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member Author

/gemini review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

# Initialize the actual workers inside worker wrapper.
all_kwargs = []
for rank, (node_id, _) in enumerate(worker_node_and_gpu_ids):
local_rank = node_workers[node_id].index(rank)
kwargs = dict(
vllm_config=self.vllm_config,
local_rank=local_rank,
rank=rank,
distributed_init_method=distributed_init_method,
is_driver_worker=(not self.parallel_config)
or (rank % self.parallel_config.tensor_parallel_size == 0),
)
all_kwargs.append(kwargs)
self._run_workers("init_worker", all_kwargs)

P1 Badge Ray workers never receive shared_worker_lock for SHM cache

The new WorkerWrapperBase.init_worker requires a shared_worker_lock when mm_processor_cache_type='shm' and raises a ValueError if it is missing. Multiprocess and uniprocess executors were updated to pass the lock, but the ray executor still builds all_kwargs without that argument before calling _run_workers("init_worker", all_kwargs). Any model launched via RayDistributedExecutor with SHM multimodal caching will now fail during worker initialization even though it used to run (albeit without the optimization). Please include a shared_worker_lock in the kwargs for each ray worker so the cache can be created instead of crashing.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces metrics for multi-modal cache statistics and refactors the shared memory cache implementation for better code structure. The changes are well-integrated with the existing metrics and logging infrastructure, and new tests have been added to validate the functionality. However, the refactoring introduces critical breaking changes to the V0 engine execution path by modifying shared components like WorkerBase without ensuring backward compatibility. These regressions need to be addressed to prevent breaking existing V0 functionality for users.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify bot added the tpu Related to Google TPUs label Oct 9, 2025
@DarkLight1337 DarkLight1337 changed the title [Metrics] Log multi-modal cache stats [Metrics] Log multi-modal cache stats and fix reset Oct 9, 2025
@DarkLight1337 DarkLight1337 requested review from Isotr0py and removed request for ywang96 October 9, 2025 10:02
@Isotr0py
Copy link
Member

Isotr0py commented Oct 9, 2025

Seems the distributed test failure is related.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@vllm-bot vllm-bot merged commit ad430a6 into vllm-project:main Oct 10, 2025
50 of 52 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Multi-modality Core Oct 10, 2025
@DarkLight1337 DarkLight1337 deleted the log-mm-cache-stats-new branch October 10, 2025 08:46
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
@markmc
Copy link
Member

markmc commented Oct 10, 2025

cc @markmc

Thanks for the cc and sorry for the delay, - metrics side lgtm!

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend multi-modality Related to multi-modality (#4194) performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants