-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Model] Add num_cached_tokens for PoolingRequestOutput #27378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add num_cached_tokens for PoolingRequestOutput #27378
Conversation
Signed-off-by: wang.yuqi <noooop@126.com>
0ee0fd4 to
11f60fc
Compare
|
Start CI test to check what CI failures in main that still need to be fixed. |
| vllm_outputs = vllm_model.classify(example_prompts) | ||
|
|
||
| # First Run | ||
| vllm_model.classify(example_prompts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that initially the number of cached tokens is zero?
| PoolingRequestOutput[Any]( | ||
| request_id="", | ||
| outputs=processed_outputs, | ||
| num_cached_tokens=getattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need getattr here? In what case is that not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result of io_processor might not have this value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please unblock Language Models Test (Extended Pooling) and Language Models Test (MTEB) to check for CI failures in the main branch that still need to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think we should make this a property of PoolingRequestOutput itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
@property
def num_cached_tokens(self) -> int:
return getattr(self.processed_outputs, "num_cached_tokens", 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…27378) Signed-off-by: wang.yuqi <noooop@126.com>
…27378) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
…27378) Signed-off-by: wang.yuqi <noooop@126.com>
…27378) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…27378) Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.