Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StaticLLMPipeline: Optimize kvcache copy #1199

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yviansu
Copy link

@yviansu yviansu commented Nov 12, 2024

No description provided.

Signed-off-by: Su Yihan <yihan.su@intel.com>
@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms labels Nov 12, 2024
@dmatveev
Copy link

Thanks @yviansu for your contribution! @AsyaPronina can you please provide an early feedback on this?

@ilya-lavrenov ilya-lavrenov added category: NPU and removed category: sampling Sampling / Decoding algorithms labels Nov 20, 2024
@AsyaPronina
Copy link
Contributor

Hello Dear Su Yihan (@yviansu)! Could you provide performance numbers of optimization?

const auto& input_name = kvcache_compiled.inputs()[kStartInputKVCacheLayers + i].get_any_name();
auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name);
auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't forget to remove trailing spaces


prefill_out_slice.copy_to(kvcache_in_slice);
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now models are indeed of fp16 type, but if something will be changed in this direction in future, I propose to put this code under if, so optimized code will be launched if some assumptions are met, and previous code will be launched in all other cases.

@@ -237,6 +237,14 @@ void merge_config_with(ov::AnyMap& lhs, const ov::AnyMap& rhs) {
}
}

int get_data_size(ov::Shape shape, int dim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comments will be of help here to catch this logic more quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to understand will it be always Row-Major order or it can be Column-Major order. We either need to somehow launch code only for Row-Major ordered tensors, or create two get_data_size() methods.
We also need to check if there is possiblity of different ordering for different tensor layouts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the method can also be renamed to something likes get_step_for() or get_dim_step() to better reflect its purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to take into account strided tensors, will they appear in some use cases? If there is no use case for strided tensors, we should reflect it in the comment: why it is always safe to use this exactly function; or rewrite the function other way.


prefill_out_slice.copy_to(kvcache_in_slice);
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split line so it fits into line width of 100 or 80

prefill_out_slice.copy_to(kvcache_in_slice);
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data();
int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to make something like this:

auto dim_step = get_step_for(prefill_out_tensor.get_shape(), m_kvcache_desc.dim)
auto start_offset = dim_step * (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens);
auto full_dim_size = dim_step * prefill_out_tensor.get_shape()[m_kvcache_desc.dim];

And to do the same with kvcache or dst tensor data.

int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);
int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please predefine limit for k out of cycle?
Like, we can do something like this:
auto num_dim_repeats = ...;

uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data();
int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);
int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please predefine it a bit above and reuse it for src_ptr calculation. You can name it as dim_step as proposed in comments above.

int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) {
memcpy(dst_ptr + k * dst_gap_size, src_ptr + k * src_gap_size, copy_size * sizeof(ov::float16) * m_kvcache_desc.num_stored_tokens);
Copy link
Contributor

@AsyaPronina AsyaPronina Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It might make sense to make all calculation right in the bytes from the start and don't multiply on data type here.
  2. It seems that logic is correct for m_kvcache_desc.dim == 1 and might be m_kvcache_desc.dim == 0 (if such case exists). However it seems that we won't cover full tensor for m_kvcache_desc.dim == 2, as we will go through the shape[1] * get_step_for(..., 1) here (k=0..shape[1], dst_gap_size=get_step_for(..., 1), what is only step for dim == 0 and not the full tensor size. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) category: NPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants