-
Notifications
You must be signed in to change notification settings - Fork 555
Description
🚀 The feature, motivation and pitch
Motivation
The vLLM v1 supports using the KVConnectorBase_V1 class to support the use of external storage for storing and retrieving kv cache, which can achieve a trade-off between storage and computation. This feature has not yet been integrated into vllm-ascend. Considering the general support for multiple storage backends, the addition of this feature is worth considering.
Proposed Change
We can refer to gpu_model_runner.py and layer.py in vLLM v1 to add calls to the connector functions in vLLM-ascend, in order to achieve the purpose of implementing this feature. Additionally, since we aim to return each request's successfully dumped blocks after the worker side' saving is completed, we have added a return value to the wait_for_save function, which is a part not increased in vLLM.
Code
Alternatives
No response
Additional context
No response