Share past kv between prefill and kvcache #19

intelgaoxiong · 2025-09-22T01:15:58Z

Details:

item1
...

Tickets:

ticket-id

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

dmatveev · 2025-09-22T09:24:56Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+    const auto& prefill_compiled = m_npuw_llm_compiled_model->m_prefill_compiled;
+    for (std::size_t idx = 0; idx < prefill_compiled->m_compiled_submodels.size(); ++idx) {
+        if (prefill_compiled->submodel_device(idx) == "NPU") {
+            pre_alloc_on_npu = true;
+            break;
+        }
+    }


So what's the reason for this check? I believe the idea is to.. guarantee kvcache will be allocated on NPU to make sure the 2nd model reads it with no overhead - but shouldn't we check the 2nd model devices instead?

Yes, you're right.
Modified.

dmatveev · 2025-09-22T09:25:29Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+
+        // Record that we have already bind past_kv, will need data copy when update past kv in infer requests to
+        // ensure correct data layout
+        m_past_kv_binded = true;


Suggested change

m_past_kv_binded = true;

m_past_kv_bound = true;

dmatveev · 2025-09-22T09:28:07Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+        if (input_name.find("past_key_values") == std::string::npos) {
+            continue;
+        }


I believe there were some predefined constants for this

dmatveev · 2025-09-22T09:33:21Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+        auto origTensor = m_prefill_request->get_tensor(input_port);
+        auto new_tensor =
+            ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data));
+        m_prefill_request->set_tensor(input_port, new_tensor);


What you set here as an input is a larger tensor than it is supposed to be, e.g. for prefill with prompt 1024 we'll in fact set 1152 (assuming MIN_RESPONSE_LEN 128).

While it certainly works for the case when attention block and the range selector are properly identified (we're taking the necessary views of that tensor), it will break if:

attention block failed to detect, so prefill model is actually has all static shapes

range is set to "ALL" (where we just do "set_tensor"). The past k/v tensors won't be compatible with the attention mask passed to the subgraph.

So I'd recommend to take here a view [0..PROMPT_SIZE] over the kv-dim

auto new_tensor = ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data));

The new_tensor is reusing the data ptr, but the tensor shape of new_tensor is still origTensor->get_shape()
which is PROMPT_SIZE on kv-dim already.

I think it may not work to take a view [0..PROMPT_SIZE] over the kv-dim right now? NPU does not support strided parameter at this moment.

but in the prefill NPU won't access it either?

Yes strided input is not supported but we can pipeline it with the host-side copy, it shouldn't impact prefill much

Chunk prefill will access past KV as well.

dmatveev · 2025-09-22T09:34:55Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+                // Create backup of past KV tensor when buffer sharing is enabled to prevent data corruption
+                // This is necessary because subsequent copy operations would overwrite the shared buffer


When sharing is in place, we don't need to copy anything, is that right?

We only need to copy the last chunk's results.

Initially, I shared the same perspective.
However, after meeting some accuracy issue and conducting debugging, I realized that copying is needed.

This is because, during the prefill and decoding phases, although the same buffer is shared, the past KV tensor shapes differ between the two phases.

For instance, consider an input length of 8K and an output length of 128:
In the prefill phase, the past k tensor shape is 1x8x7168x1.
In the decoding phase, the past k tensor shape is 1x8x8319x1.
These differing tensor shapes imply that the tensors have different strides.
Consequently, the past KV stored in a 1x8x7168x1 tensor cannot be directly used for decoding.

with strided reads/writes it still should work.. but as we don't have them at the moment, host-side copy is a thing again

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

Share past kv between prefill and kvcache.

578b226

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

dmatveev reviewed Sep 22, 2025

View reviewed changes

Solved review comments for past kv sharing.

743be74

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>

smirnov-alexey mentioned this pull request Oct 31, 2025

[NPUW] Share kvcache between prefill and generate when chunking is enabled openvinotoolkit/openvino#32642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Share past kv between prefill and kvcache #19

Share past kv between prefill and kvcache #19

Uh oh!

intelgaoxiong commented Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

intelgaoxiong Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

intelgaoxiong Sep 22, 2025 •

edited

Loading

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

intelgaoxiong Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

intelgaoxiong Sep 22, 2025

Uh oh!

dmatveev Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Create backup of past KV tensor when buffer sharing is enabled to prevent data corruption
		// This is necessary because subsequent copy operations would overwrite the shared buffer

Share past kv between prefill and kvcache #19

Are you sure you want to change the base?

Share past kv between prefill and kvcache #19

Uh oh!

Conversation

intelgaoxiong commented Sep 22, 2025

Details:

Tickets:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intelgaoxiong Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

intelgaoxiong Sep 22, 2025 •

edited

Loading