-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector #25712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector #25712
Conversation
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully enables the use of the hybrid allocator with the KV cache connector by removing the explicit restriction and adding the necessary logic to handle multiple KV cache groups. The changes are well-structured, introducing a SupportsHMA interface to check for compatibility. My review focuses on improving code quality and performance. I've identified an opportunity to refactor duplicated code for better maintainability and two instances where an expensive deepcopy operation can be replaced with a more efficient shallow copy, which should improve initialization performance.
|
@NickLucche @njhill This is the refactored version of #25363 , PTAL |
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for arriving late with these comments
| SupportsHMA interface instead of list[int] from | ||
| KVConnectorBase_V1 to support hybrid memory allocation. | ||
| """ | ||
| return self._lmcache_engine.request_finished(request, block_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know whether the new argument format is supported by the lmcache code that's installed? I'd expect a version check or some sort of capability check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see LMCache/LMCache#1436 now - you need to require this version somehow? The old version will blow up if you pass it the tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LMCache is not relying on request_finished, so changing the signature is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't follow - this PR will work with older lmcache versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently request_finished is a placeholder function that simply does nothing in LMCache. So it is OK even if we pass block_ids as tuple[list[int]] to LMCache because LMCache won't process it anyway.
| def request_finished( | ||
| self, | ||
| request: "Request", | ||
| block_ids: tuple[list[int], ...], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why it's tempting to do this, but I think this sort of overloading can cause unnecessary confusion - how about making this more explicit by calling the method something like request_finished_all_groups() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to think this for a while, but this might be a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 @NickLucche I have no preference personally. Do you guys have preference on having a new function request_finished_all_groups when passing block_ids as tuple of list of int?
| # all connectors to support HMA. | ||
| return self.connector.request_finished(request, block_ids[0]) | ||
| else: | ||
| return self.connector.request_finished(request, block_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this logic in the connectors module ...
def request_finished(
connector: KVConnectorBase_V1,
request: "Request",
block_ids: tuple[list[int], ...],
) -> tuple[bool, dict[str, Any] | None]:
if isinstance(connector, SupportsHMA):
return connector.request_finished_all_groups(request, block_ids)
else: # for backwards compatibility
return connector.request_finished(request, block_ids[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function _connector_finished is already a small wrapper function that contains < 10 LoC besides comments. Building one more wrapper on top of it may feel a bit over-abstracted.
| if self.vllm_config.kv_transfer_config is not None: | ||
| assert len(self.kv_cache_config.kv_cache_groups) == 1, ( | ||
| "Multiple KV cache groups are not currently supported " | ||
| "with KV connectors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to move this assertion into the backwards compat code
def request_finished(
connector: KVConnectorBase_V1,
request: "Request",
block_ids: tuple[list[int], ...],
) -> tuple[bool, dict[str, Any] | None]:
if isinstance(connector, SupportsHMA):
return connector.request_finished_all_groups(request, block_ids)
else: # for backwards compatibility
assert connector.kv_cache_config.kv_cache_groups == 1
return connector.request_finished(request, block_ids[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar assertion is done during initializing the connector. It is better to assert during initialization instead of request_finished to avoid the case where the user sees vLLM server launch up but it fails the assertion during the inference.
| # because `initialize_kv_cache` will inject kv cache groups not | ||
| # related to kv cache connector (e.g. kv cache sharing layers). | ||
| connector_vllm_config = copy.copy(self.vllm_config) | ||
| connector_vllm_config.kv_cache_config = copy.copy(kv_cache_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using a copy of VllmConfig as a holder to send KVCacheConfig down to the connector? And VllmConfig doesn't ordinarily have a kv_cache_config member? That seems extremely brittle?
Why not just make KVCacheConfig a constructor parameter for KVConnector?
Or could we instead supply the connector the layer ID/name to KV cache group ID mapping? Will all connectors need this mapping to support HMA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons of not initializing using vllm_config and kv_cache_config separately:
- Having
kv_cache_configas an extra arg in the constructor breaks backward compatibility because the connector may fail to initialize if we putvllm_configandkv_cache_configas a two separate args into an old connector. - Putting
KVCacheConfigintovllm_configaligns with @heheda12345 's future refactoring direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backwards compat is tricky, but we can't allow ourselves to be trapped in a situation where any new data for KVConnector must be stuffed into VllmConfig
e.g. we could add an init_kv_cache_config() method and detect whether a connector implements the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be stuffed into vllm_config because the design goal of vllm_config is to centralize all the configurations into one place. Also the extra init argument kv_cache_config will be merged into vllm_config soon according to @heheda12345 . So for now I would still prefer inserting the kv_cache_config directly into vllm_config and then remove the injection after vllm_config is done.
I understand that allowing arbitrary field injection is generally not ideal, but it aligns with the design goal of vllm_config and the refactoring direction of kv_cache_config so I would still prefer the current way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only just seeing this now but agree with @markmc this is hacky and I don't think we should have merged it.
Either include the changes to add that field to VllmConfig or we could use introspection on the connector for backwards compatibility.
| parallel_config.decode_context_parallel_size, | ||
| ) | ||
|
|
||
| ensure_kv_transfer_initialized(vllm_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that other connectors (out of tree perhaps) might break by initializing earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine because both locations are still before real model execution and CUDA graph capturing. So in terms of the ability of adding extra GPU operations before/after attention and before/after forwarding these two locations are the same.
|
Just noticed this: from the PR (#19330) which adds logic to retry requests locally in D if KV fetching from P fails |
Get it. Prefer to fix it in future PR though. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…aiDu/vllm into kuntai-enable-hma-connector
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
… KV cache connector (vllm-project#25712) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Follow on from vllm-project#25712 `VllmConfig` is explicitly designed as a dataclass containing user-provided configuration and model metadata. It is a global configuration object that lives throughout the entire engine lifetime and is meant to be immutable after `__post_init__()`. `KVCacheConfig` is worker-specific, runtime-computed state. It has limited lifetime, and its purpose is limited to initializing the KV Cache in the model runner. Even if we add KV cache hints to model config.json in future, this would be parsed into `ModelConfig`, used as input to the `get_kv_cache_configs()` computation, and the resulting `KVCacheConfig` would still be runtime state. We are currently creating per-worker copies of VllmConfig in order to attach the runtime `KVCacheConfig` state. But instead we should just explicitly pass `KVCacheConfig` to the connector. Make sure to handle backwards compatibility for external connector implementations (loaded via module path) that have the old style constructor signature. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Follow on from vllm-project#25712 `VllmConfig` is explicitly designed as a dataclass containing user-provided configuration and model metadata. It is a global configuration object that lives throughout the entire engine lifetime and is meant to be immutable after `__post_init__()`. `KVCacheConfig` is worker-specific, runtime-computed state. It has limited lifetime, and its purpose is limited to initializing the KV Cache in the model runner. Even if we add KV cache hints to model config.json in future, this would be parsed into `ModelConfig`, used as input to the `get_kv_cache_configs()` computation, and the resulting `KVCacheConfig` would still be runtime state. We are currently creating per-worker copies of VllmConfig in order to attach the runtime `KVCacheConfig` state. But instead we should just explicitly pass `KVCacheConfig` to the connector. Make sure to handle backwards compatibility for external connector implementations (loaded via module path) that have the old style constructor signature. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Follow on from vllm-project#25712 `VllmConfig` is explicitly designed as a dataclass containing user-provided configuration and model metadata. It is a global configuration object that lives throughout the entire engine lifetime and is meant to be immutable after `__post_init__()`. `KVCacheConfig` is worker-specific, runtime-computed state. It has limited lifetime, and its purpose is limited to initializing the KV Cache in the model runner. Even if we add KV cache hints to model config.json in future, this would be parsed into `ModelConfig`, used as input to the `get_kv_cache_configs()` computation, and the resulting `KVCacheConfig` would still be runtime state. We are currently creating per-worker copies of VllmConfig in order to attach the runtime `KVCacheConfig` state. But instead we should just explicitly pass `KVCacheConfig` to the connector. Make sure to handle backwards compatibility for external connector implementations (loaded via module path) that have the old style constructor signature. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Purpose
Refactor of #25363 . This PR enables the combination of hybrid allocator + KV cache connector in a backward-compatible way.
Test Script
Test Result
Success.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.