StaticLLMPipeline: Optimize kvcache copy #1199

yviansu · 2024-11-12T06:54:49Z

No description provided.

Signed-off-by: Su Yihan <yihan.su@intel.com>

dmatveev · 2024-11-19T21:18:14Z

Thanks @yviansu for your contribution! @AsyaPronina can you please provide an early feedback on this?

AsyaPronina · 2024-11-20T13:04:51Z

Hello Dear Su Yihan (@yviansu)! Could you provide performance numbers of optimization?

AsyaPronina · 2024-11-20T16:34:19Z

src/cpp/src/llm_pipeline_static.cpp

        const auto& input_name = kvcache_compiled.inputs()[kStartInputKVCacheLayers + i].get_any_name();
-        auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name);
+        auto kvcache_in_tensor = m_kvcache_request.get_tensor(input_name); 


please don't forget to remove trailing spaces

AsyaPronina · 2024-11-20T16:45:01Z

src/cpp/src/llm_pipeline_static.cpp


-        prefill_out_slice.copy_to(kvcache_in_slice);
+        uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);


right now models are indeed of fp16 type, but if something will be changed in this direction in future, I propose to put this code under if, so optimized code will be launched if some assumptions are met, and previous code will be launched in all other cases.

AsyaPronina · 2024-11-20T16:57:53Z

src/cpp/src/llm_pipeline_static.cpp

@@ -237,6 +237,14 @@ void merge_config_with(ov::AnyMap& lhs, const ov::AnyMap& rhs) {
    }
 }

+int get_data_size(ov::Shape shape, int dim) {


I think some comments will be of help here to catch this logic more quickly

We also need to understand will it be always Row-Major order or it can be Column-Major order. We either need to somehow launch code only for Row-Major ordered tensors, or create two get_data_size() methods.
We also need to check if there is possiblity of different ordering for different tensor layouts.

I think the method can also be renamed to something likes get_step_for() or get_dim_step() to better reflect its purpose.

We also need to take into account strided tensors, will they appear in some use cases? If there is no use case for strided tensors, we should reflect it in the comment: why it is always safe to use this exactly function; or rewrite the function other way.

AsyaPronina · 2024-11-20T17:07:05Z

src/cpp/src/llm_pipeline_static.cpp


-        prefill_out_slice.copy_to(kvcache_in_slice);
+        uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);


please split line so it fits into line width of 100 or 80

AsyaPronina · 2024-11-20T17:12:52Z

src/cpp/src/llm_pipeline_static.cpp

-        prefill_out_slice.copy_to(kvcache_in_slice);
+        uint16_t* src_ptr = (uint16_t*)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) * get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
+        uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data();
+        int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);


I propose to make something like this:

auto dim_step = get_step_for(prefill_out_tensor.get_shape(), m_kvcache_desc.dim) auto start_offset = dim_step * (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens); auto full_dim_size = dim_step * prefill_out_tensor.get_shape()[m_kvcache_desc.dim];

And to do the same with kvcache or dst tensor data.

AsyaPronina · 2024-11-20T20:43:41Z

src/cpp/src/llm_pipeline_static.cpp

+        int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);
+        int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
+        int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
+        for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) {


Could you please predefine limit for k out of cycle?
Like, we can do something like this:
auto num_dim_repeats = ...;

AsyaPronina · 2024-11-20T20:57:07Z

src/cpp/src/llm_pipeline_static.cpp

+        uint16_t* dst_ptr = (uint16_t*)kvcache_in_tensor.data();
+        int src_gap_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim - 1);
+        int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
+        int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);


Please predefine it a bit above and reuse it for src_ptr calculation. You can name it as dim_step as proposed in comments above.

AsyaPronina · 2024-11-20T21:18:20Z

src/cpp/src/llm_pipeline_static.cpp

+        int dst_gap_size = get_data_size(kvcache_in_tensor.get_shape(), m_kvcache_desc.dim - 1);
+        int copy_size = get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);
+        for(int k = 0; k < (m_kvcache_desc.dim > 0 ? kvcache_in_tensor.get_shape().at(m_kvcache_desc.dim - 1) : 1); k++) {
+            memcpy(dst_ptr + k * dst_gap_size, src_ptr + k * src_gap_size, copy_size * sizeof(ov::float16) * m_kvcache_desc.num_stored_tokens);


It might make sense to make all calculation right in the bytes from the start and don't multiply on data type here.

It seems that logic is correct for m_kvcache_desc.dim == 1 and might be m_kvcache_desc.dim == 0 (if such case exists). However it seems that we won't cover full tensor for m_kvcache_desc.dim == 2, as we will go through the shape[1] * get_step_for(..., 1) here (k=0..shape[1], dst_gap_size=get_step_for(..., 1), what is only step for dim == 0 and not the full tensor size. What do you think?

StaticLLMPipeline: Optimize kvcache copy

a1a76b3

Signed-off-by: Su Yihan <yihan.su@intel.com>

github-actions bot added category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms labels Nov 12, 2024

ilya-lavrenov added category: NPU and removed category: sampling Sampling / Decoding algorithms labels Nov 20, 2024

AsyaPronina requested changes Nov 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StaticLLMPipeline: Optimize kvcache copy #1199

StaticLLMPipeline: Optimize kvcache copy #1199

yviansu commented Nov 12, 2024

dmatveev commented Nov 19, 2024

AsyaPronina commented Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024

AsyaPronina Nov 20, 2024 •

edited

Loading


		prefill_out_slice.copy_to(kvcache_in_slice);
		uint16_t* src_ptr = (uint16_t)prefill_out_tensor.data() + (m_kvcache_desc.max_prompt_size - m_kvcache_desc.num_stored_tokens) get_data_size(prefill_out_tensor.get_shape(), m_kvcache_desc.dim);

StaticLLMPipeline: Optimize kvcache copy #1199

Are you sure you want to change the base?

StaticLLMPipeline: Optimize kvcache copy #1199

Conversation

yviansu commented Nov 12, 2024

dmatveev commented Nov 19, 2024

AsyaPronina commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AsyaPronina Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

AsyaPronina Nov 20, 2024 •

edited

Loading