[RFC]: native kvcache offloading

### Motivation.

Currently, in vLLM v1 there is  in-house solution for offloading KV cache data from the GPU memory to other medium (in particular, CPU memory).But it is not yet available in vllm-ascend, so I hope to add this feature to vllm -ascend.

There is a proposed RFC ([#19854](https://github.com/vllm-project/vllm/issues/19854)) from vllm，and PRs([#19848](https://github.com/vllm-project/vllm/pull/19848)，[#20075](https://github.com/vllm-project/vllm/pull/19848)， [#21448](https://github.com/vllm-project/vllm/pull/21448)， [#22595](https://github.com/vllm-project/vllm/pull/22595) and [#24251](https://github.com/vllm-project/vllm/pull/24251))

### Proposed Change.

To enable this feature，we need add a CPUOffloadingSpec compatible with vllm-ascend，and add a CpuNpuOffloadingHandler for kvcache offloading and loading(cpu->npu and npu-> cpu).

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: native kvcache offloading #3241

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: native kvcache offloading #3241

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions