-
Notifications
You must be signed in to change notification settings - Fork 0
Share past kv between prefill and kvcache #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dm/dynctx
Are you sure you want to change the base?
Share past kv between prefill and kvcache #19
Conversation
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
| const auto& prefill_compiled = m_npuw_llm_compiled_model->m_prefill_compiled; | ||
| for (std::size_t idx = 0; idx < prefill_compiled->m_compiled_submodels.size(); ++idx) { | ||
| if (prefill_compiled->submodel_device(idx) == "NPU") { | ||
| pre_alloc_on_npu = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the reason for this check? I believe the idea is to.. guarantee kvcache will be allocated on NPU to make sure the 2nd model reads it with no overhead - but shouldn't we check the 2nd model devices instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right.
Modified.
|
|
||
| // Record that we have already bind past_kv, will need data copy when update past kv in infer requests to | ||
| // ensure correct data layout | ||
| m_past_kv_binded = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| m_past_kv_binded = true; | |
| m_past_kv_bound = true; |
| if (input_name.find("past_key_values") == std::string::npos) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there were some predefined constants for this
| auto origTensor = m_prefill_request->get_tensor(input_port); | ||
| auto new_tensor = | ||
| ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data)); | ||
| m_prefill_request->set_tensor(input_port, new_tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you set here as an input is a larger tensor than it is supposed to be, e.g. for prefill with prompt 1024 we'll in fact set 1152 (assuming MIN_RESPONSE_LEN 128).
While it certainly works for the case when attention block and the range selector are properly identified (we're taking the necessary views of that tensor), it will break if:
- attention block failed to detect, so prefill model is actually has all static shapes
- range is set to "ALL" (where we just do "set_tensor"). The past k/v tensors won't be compatible with the attention mask passed to the subgraph.
So I'd recommend to take here a view [0..PROMPT_SIZE] over the kv-dim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto new_tensor =
ov::get_tensor_impl(ov::Tensor(origTensor->get_element_type(), origTensor->get_shape(), data));
The new_tensor is reusing the data ptr, but the tensor shape of new_tensor is still origTensor->get_shape()
which is PROMPT_SIZE on kv-dim already.
I think it may not work to take a view [0..PROMPT_SIZE] over the kv-dim right now? NPU does not support strided parameter at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in the prefill NPU won't access it either?
Yes strided input is not supported but we can pipeline it with the host-side copy, it shouldn't impact prefill much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunk prefill will access past KV as well.
| // Create backup of past KV tensor when buffer sharing is enabled to prevent data corruption | ||
| // This is necessary because subsequent copy operations would overwrite the shared buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When sharing is in place, we don't need to copy anything, is that right?
We only need to copy the last chunk's results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I shared the same perspective.
However, after meeting some accuracy issue and conducting debugging, I realized that copying is needed.
This is because, during the prefill and decoding phases, although the same buffer is shared, the past KV tensor shapes differ between the two phases.
For instance, consider an input length of 8K and an output length of 128:
In the prefill phase, the past k tensor shape is 1x8x7168x1.
In the decoding phase, the past k tensor shape is 1x8x8319x1.
These differing tensor shapes imply that the tensors have different strides.
Consequently, the past KV stored in a 1x8x7168x1 tensor cannot be directly used for decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with strided reads/writes it still should work.. but as we don't have them at the moment, host-side copy is a thing again
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>
Details:
Tickets: