[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector #25712

KuntaiDu · 2025-09-25T21:27:19Z

Purpose

Refactor of #25363 . This PR enables the combination of hybrid allocator + KV cache connector in a backward-compatible way.

Test Script



import os

# Set token chunk size to 256
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
# Enable CPU memory backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"
# Set CPU memory limit to 5GB
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "20.0"
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["LMCACHE_USE_LAYERWISE"] = "True"


from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig

# Configure KV cache transfer to use LMCache
ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",
    kv_role="kv_both",
)

# Initialize LLM with LMCache configuration
# Adjust gpu_memory_utilization based on your GPU memory
llm = LLM(model="google/gemma-3-4b-it",
          kv_transfer_config=ktc,
          max_model_len=75000,
          gpu_memory_utilization=0.18,
          enforce_eager=True)

# Define sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

# Run inference
outputs = llm.generate("hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache and will success
outputs = llm.generate("hi" * 10000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# flush out prefix cache in GPU
outputs = llm.generate("1" + "hi" * 70000 + "\nhow are you?", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

# This requires loading KV cache
# but this request cannot be executed as vLLM cannot allocate for long prefix 
# stored by LMCache
outputs = llm.generate("hi" * 70000 + "\nTell me a story.", sampling_params)
generated_text = outputs[0].outputs[0].text
print(f"Generated text: {generated_text!r}")

Test Result

Success.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist

Code Review

This pull request successfully enables the use of the hybrid allocator with the KV cache connector by removing the explicit restriction and adding the necessary logic to handle multiple KV cache groups. The changes are well-structured, introducing a SupportsHMA interface to check for compatibility. My review focuses on improving code quality and performance. I've identified an opportunity to refactor duplicated code for better maintainability and two instances where an expensive deepcopy operation can be replaced with a more efficient shallow copy, which should improve initialization performance.

vllm/v1/core/sched/scheduler.py

vllm/v1/worker/gpu_worker.py

KuntaiDu · 2025-09-25T21:53:10Z

@NickLucche @njhill This is the refactored version of #25363 , PTAL

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

vllm/config/__init__.py

mergify · 2025-10-01T12:54:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/gpu_worker.py

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

markmc

Apologies for arriving late with these comments

markmc · 2025-10-21T08:48:19Z

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py

+            SupportsHMA interface instead of list[int] from
+            KVConnectorBase_V1 to support hybrid memory allocation.
        """
        return self._lmcache_engine.request_finished(request, block_ids)


How do we know whether the new argument format is supported by the lmcache code that's installed? I'd expect a version check or some sort of capability check here

Ok, I see LMCache/LMCache#1436 now - you need to require this version somehow? The old version will blow up if you pass it the tuple?

LMCache is not relying on request_finished, so changing the signature is OK.

I really don't follow - this PR will work with older lmcache versions?

Currently request_finished is a placeholder function that simply does nothing in LMCache. So it is OK even if we pass block_ids as tuple[list[int]] to LMCache because LMCache won't process it anyway.

markmc · 2025-10-21T08:52:01Z

vllm/distributed/kv_transfer/kv_connector/v1/base.py

+    def request_finished(
+        self,
+        request: "Request",
+        block_ids: tuple[list[int], ...],


I understand why it's tempting to do this, but I think this sort of overloading can cause unnecessary confusion - how about making this more explicit by calling the method something like request_finished_all_groups() ?

I still need to think this for a while, but this might be a good idea

@heheda12345 @NickLucche I have no preference personally. Do you guys have preference on having a new function request_finished_all_groups when passing block_ids as tuple of list of int?

markmc · 2025-10-21T09:38:12Z

vllm/v1/core/sched/scheduler.py

+            # all connectors to support HMA.
+            return self.connector.request_finished(request, block_ids[0])
+        else:
+            return self.connector.request_finished(request, block_ids)


I'd prefer to see this logic in the connectors module ...

def request_finished( connector: KVConnectorBase_V1, request: "Request", block_ids: tuple[list[int], ...], ) -> tuple[bool, dict[str, Any] | None]: if isinstance(connector, SupportsHMA): return connector.request_finished_all_groups(request, block_ids) else: # for backwards compatibility return connector.request_finished(request, block_ids[0])

This function _connector_finished is already a small wrapper function that contains < 10 LoC besides comments. Building one more wrapper on top of it may feel a bit over-abstracted.

markmc · 2025-10-21T09:40:20Z

vllm/v1/core/sched/scheduler.py

        if self.vllm_config.kv_transfer_config is not None:
-            assert len(self.kv_cache_config.kv_cache_groups) == 1, (
-                "Multiple KV cache groups are not currently supported "
-                "with KV connectors"


It would be good to move this assertion into the backwards compat code

def request_finished( connector: KVConnectorBase_V1, request: "Request", block_ids: tuple[list[int], ...], ) -> tuple[bool, dict[str, Any] | None]: if isinstance(connector, SupportsHMA): return connector.request_finished_all_groups(request, block_ids) else: # for backwards compatibility assert connector.kv_cache_config.kv_cache_groups == 1 return connector.request_finished(request, block_ids[0])

Similar assertion is done during initializing the connector. It is better to assert during initialization instead of request_finished to avoid the case where the user sees vLLM server launch up but it fails the assertion during the inference.

markmc · 2025-10-21T09:52:29Z

vllm/v1/worker/gpu_worker.py

+        # because `initialize_kv_cache` will inject kv cache groups not
+        # related to kv cache connector (e.g. kv cache sharing layers).
+        connector_vllm_config = copy.copy(self.vllm_config)
+        connector_vllm_config.kv_cache_config = copy.copy(kv_cache_config)


We're using a copy of VllmConfig as a holder to send KVCacheConfig down to the connector? And VllmConfig doesn't ordinarily have a kv_cache_config member? That seems extremely brittle?

Why not just make KVCacheConfig a constructor parameter for KVConnector?

Or could we instead supply the connector the layer ID/name to KV cache group ID mapping? Will all connectors need this mapping to support HMA?

Two reasons of not initializing using vllm_config and kv_cache_config separately:

Having kv_cache_config as an extra arg in the constructor breaks backward compatibility because the connector may fail to initialize if we put vllm_config and kv_cache_config as a two separate args into an old connector.

Putting KVCacheConfig into vllm_config aligns with @heheda12345 's future refactoring direction.

Backwards compat is tricky, but we can't allow ourselves to be trapped in a situation where any new data for KVConnector must be stuffed into VllmConfig

e.g. we could add an init_kv_cache_config() method and detect whether a connector implements the method

It should be stuffed into vllm_config because the design goal of vllm_config is to centralize all the configurations into one place. Also the extra init argument kv_cache_config will be merged into vllm_config soon according to @heheda12345 . So for now I would still prefer inserting the kv_cache_config directly into vllm_config and then remove the injection after vllm_config is done.

I understand that allowing arbitrary field injection is generally not ideal, but it aligns with the design goal of vllm_config and the refactoring direction of kv_cache_config so I would still prefer the current way.

I'm only just seeing this now but agree with @markmc this is hacky and I don't think we should have merged it.

Either include the changes to add that field to VllmConfig or we could use introspection on the connector for backwards compatibility.

markmc · 2025-10-21T09:53:16Z

vllm/v1/worker/gpu_worker.py

        parallel_config.decode_context_parallel_size,
    )
-
-    ensure_kv_transfer_initialized(vllm_config)


Is it possible that other connectors (out of tree perhaps) might break by initializing earlier?

It should be fine because both locations are still before real model execution and CUDA graph capturing. So in terms of the ability of adding extra GPU operations before/after attention and before/after forwarding these two locations are the same.

markmc · 2025-10-21T13:10:37Z

Just noticed this:

def _update_requests_with_invalid_blocks():
    ...
    # TODO (davidb): add support for hybrid memory allocator
    (req_block_ids,) = self.kv_cache_manager.get_block_ids(req_id)

from the PR (#19330) which adds logic to retry requests locally in D if KV fetching from P fails

KuntaiDu · 2025-10-21T18:51:01Z

Just noticed this:
def _update_requests_with_invalid_blocks():
    ...
    # TODO (davidb): add support for hybrid memory allocator
    (req_block_ids,) = self.kv_cache_manager.get_block_ids(req_id)
from the PR (#19330) which adds logic to retry requests locally in D if KV fetching from P fails

Get it. Prefer to fix it in future PR though.

mergify · 2025-10-22T11:32:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @KuntaiDu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…aiDu/vllm into kuntai-enable-hma-connector

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Follow on from vllm-project#25712 `VllmConfig` is explicitly designed as a dataclass containing user-provided configuration and model metadata. It is a global configuration object that lives throughout the entire engine lifetime and is meant to be immutable after `__post_init__()`. `KVCacheConfig` is worker-specific, runtime-computed state. It has limited lifetime, and its purpose is limited to initializing the KV Cache in the model runner. Even if we add KV cache hints to model config.json in future, this would be parsed into `ModelConfig`, used as input to the `get_kv_cache_configs()` computation, and the resulting `KVCacheConfig` would still be runtime state. We are currently creating per-worker copies of VllmConfig in order to attach the runtime `KVCacheConfig` state. But instead we should just explicitly pass `KVCacheConfig` to the connector. Make sure to handle backwards compatibility for external connector implementations (loaded via module path) that have the old style constructor signature. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

KuntaiDu added 2 commits September 25, 2025 14:16

Refactor: make sure the API calls are backward compatible

1ded8ae

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

align function signature

42040ba

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

KuntaiDu requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 25, 2025 21:27

KuntaiDu mentioned this pull request Sep 25, 2025

[Core][Hybrid allocator + connector 1/n] Enable KV cache connector + hybrid allocator #25363

Closed

5 tasks

mergify bot added v1 kv-connector labels Sep 25, 2025

gemini-code-assist bot reviewed Sep 25, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

vllm/v1/core/sched/scheduler.py Show resolved Hide resolved

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

KuntaiDu added 2 commits September 26, 2025 12:14

fix mypy errors

fbaa51a

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

adjust the signature of block_ids

fae4c82

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

hmellor reviewed Oct 1, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Oct 1, 2025

heheda12345 reviewed Oct 2, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

fix CI errors

1974b5f

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

markmc reviewed Oct 21, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 22, 2025

KuntaiDu added 2 commits October 23, 2025 13:27

fix NIXL-connector-related CI errors

9198d3e

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'main' into kuntai-enable-hma-connector

eee8c11

Signed-off-by: Kuntai Du <kuntai@uchicago.edu>

mergify bot removed the needs-rebase label Oct 23, 2025

KuntaiDu added 10 commits October 23, 2025 14:16

fix CI errors

c6e0bc4

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

remove hma support from LMCache for now

919fe9b

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

add an extra sanity check for request_finished

0b67b76

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'main' into kuntai-enable-hma-connector

4bfdcf8

fix bug

36e42a1

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'kuntai-enable-hma-connector' of https://github.com/Kunt…

6f6347c

…aiDu/vllm into kuntai-enable-hma-connector

fix CI bug

0df4f02

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

fix CI issues

5d88c0d

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

fix CI errors

2fac4fb

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Merge branch 'main' into kuntai-enable-hma-connector

4c724a6

simon-mo disabled auto-merge October 25, 2025 06:34

simon-mo merged commit b853540 into vllm-project:main Oct 25, 2025
50 of 52 checks passed

KuntaiDu mentioned this pull request Oct 27, 2025

[Stability fix] turn off HMA allocator when connector is set #27592

Merged

5 tasks

markmc mentioned this pull request Oct 31, 2025

[KV Connector] Make KVCacheConfig an explicit constructor argument #27887

Merged

KuntaiDu mentioned this pull request Nov 4, 2025

[Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator #28011

Merged

5 tasks

Uh oh!

[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector #25712

[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector #25712

Uh oh!

Conversation

KuntaiDu commented Sep 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Script

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Sep 25, 2025

Uh oh!

Uh oh!

mergify bot commented Oct 1, 2025

Uh oh!

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markmc commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KuntaiDu commented Oct 21, 2025

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

KuntaiDu commented Sep 25, 2025 •

edited by github-actions bot

Loading

KuntaiDu Oct 21, 2025 •

edited

Loading

markmc commented Oct 21, 2025 •

edited

Loading