-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StaticLLMPipeline: Optimize kvcache copy #1199
base: master
Are you sure you want to change the base?
StaticLLMPipeline: Optimize kvcache copy #1199
Conversation
Signed-off-by: Su Yihan <yihan.su@intel.com>
Thanks @yviansu for your contribution! @AsyaPronina can you please provide an early feedback on this? |
Hello Dear Su Yihan (@yviansu)! Could you provide performance numbers of optimization? |
const auto& input_name = kvcache_compiled.inputs()[kStartInputKVCacheLayers + i].get_any_name(); | ||
auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name); | ||
auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't forget to remove trailing spaces
|
||
prefill_out_slice.copy_to(kvcache_in_slice); | ||
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now models are indeed of fp16
type, but if something will be changed in this direction in future, I propose to put this code under if
, so optimized code will be launched if some assumptions are met, and previous code will be launched in all other cases.
@@ -237,6 +237,14 @@ void merge_config_with(ov::AnyMap& lhs, const ov::AnyMap& rhs) { | |||
} | |||
} | |||
|
|||
int get_data_size(ov::Shape shape, int dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some comments will be of help here to catch this logic more quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to understand will it be always Row-Major order or it can be Column-Major order. We either need to somehow launch code only for Row-Major ordered tensors, or create two get_data_size()
methods.
We also need to check if there is possiblity of different ordering for different tensor layouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method can also be renamed to something likes get_step_for()
or get_dim_step()
to better reflect its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to take into account strided tensors, will they appear in some use cases? If there is no use case for strided tensors, we should reflect it in the comment: why it is always safe to use this exactly function; or rewrite the function other way.
|
||
prefill_out_slice.copy_to(kvcache_in_slice); | ||
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please split line so it fits into line width of 100 or 80
prefill_out_slice.copy_to(kvcache_in_slice); | ||
uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); | ||
uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data(); | ||
int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make something like this:
auto dim_step = get_step_for(prefill_out_tensor.get_shape(), m_kvcache_desc.dim)
auto start_offset = dim_step * (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens);
auto full_dim_size = dim_step * prefill_out_tensor.get_shape()[m_kvcache_desc.dim];
And to do the same with kvcache or dst
tensor data.
int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1); | ||
int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1); | ||
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); | ||
for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please predefine limit for k
out of cycle?
Like, we can do something like this:
auto num_dim_repeats = ...;
uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data(); | ||
int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1); | ||
int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1); | ||
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please predefine it a bit above and reuse it for src_ptr
calculation. You can name it as dim_step
as proposed in comments above.
int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1); | ||
int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim); | ||
for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) { | ||
memcpy(dst_ptr + k * dst_gap_size, src_ptr + k * src_gap_size, copy_size * sizeof(ov::float16) * m_kvcache_desc.num_stored_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It might make sense to make all calculation right in the bytes from the start and don't multiply on data type here.
- It seems that logic is correct for
m_kvcache_desc.dim == 1
and might bem_kvcache_desc.dim == 0
(if such case exists). However it seems that we won't cover full tensor form_kvcache_desc.dim == 2
, as we will go through theshape[1] * get_step_for(..., 1)
here (k=0..shape[1]
,dst_gap_size=get_step_for(..., 1)
, what is only step fordim == 0
and not the full tensor size. What do you think?
No description provided.