Skip to content

Conversation

@intelgaoxiong
Copy link

Details:

  • item1
  • ...

Tickets:

  • ticket-id

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Comment on lines 260 to 266
const auto& prefill_compiled = m_npuw_llm_compiled_model->m_prefill_compiled;
for (std::size_t idx = 0; idx < prefill_compiled->m_compiled_submodels.size(); ++idx) {
if (prefill_compiled->submodel_device(idx) == "NPU") {
pre_alloc_on_npu = true;
break;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the reason for this check? I believe the idea is to.. guarantee kvcache will be allocated on NPU to make sure the 2nd model reads it with no overhead - but shouldn't we check the 2nd model devices instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right.
Modified.


// Record that we have already bind past_kv, will need data copy when update past kv in infer requests to
// ensure correct data layout
m_past_kv_binded = true;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_past_kv_binded = true;
m_past_kv_bound = true;

Comment on lines 278 to 280
if (input_name.find("past_key_values") == std::string::npos) {
continue;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there were some predefined constants for this

Comment on lines +291 to +294
auto origTensor = m_prefill_request->get_tensor(input_port);
auto new_tensor =
ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data));
m_prefill_request->set_tensor(input_port, new_tensor);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you set here as an input is a larger tensor than it is supposed to be, e.g. for prefill with prompt 1024 we'll in fact set 1152 (assuming MIN_RESPONSE_LEN 128).

While it certainly works for the case when attention block and the range selector are properly identified (we're taking the necessary views of that tensor), it will break if:

  1. attention block failed to detect, so prefill model is actually has all static shapes
  2. range is set to "ALL" (where we just do "set_tensor"). The past k/v tensors won't be compatible with the attention mask passed to the subgraph.

So I'd recommend to take here a view [0..PROMPT_SIZE] over the kv-dim

Copy link
Author

@intelgaoxiong intelgaoxiong Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto new_tensor =
            ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data));

The new_tensor is reusing the data ptr, but the tensor shape of new_tensor is still origTensor->get_shape()
which is PROMPT_SIZE on kv-dim already.

I think it may not work to take a view [0..PROMPT_SIZE] over the kv-dim right now? NPU does not support strided parameter at this moment.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in the prefill NPU won't access it either?

Yes strided input is not supported but we can pipeline it with the host-side copy, it shouldn't impact prefill much

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk prefill will access past KV as well.

Comment on lines +546 to +547
// Create backup of past KV tensor when buffer sharing is enabled to prevent data corruption
// This is necessary because subsequent copy operations would overwrite the shared buffer
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sharing is in place, we don't need to copy anything, is that right?

We only need to copy the last chunk's results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I shared the same perspective.
However, after meeting some accuracy issue and conducting debugging, I realized that copying is needed.

This is because, during the prefill and decoding phases, although the same buffer is shared, the past KV tensor shapes differ between the two phases.

For instance, consider an input length of 8K and an output length of 128:
In the prefill phase, the past k tensor shape is 1x8x7168x1.
In the decoding phase, the past k tensor shape is 1x8x8319x1.
These differing tensor shapes imply that the tensors have different strides.
Consequently, the past KV stored in a 1x8x7168x1 tensor cannot be directly used for decoding.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with strided reads/writes it still should work.. but as we don't have them at the moment, host-side copy is a thing again

Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants