llm: use stateful model #8

Wovchena · 2023-12-11T10:27:50Z

No description provided.

slyalin · 2023-12-11T13:48:34Z

llm/cpp/group_beam_searcher.hpp

@@ -234,7 +234,7 @@ struct GroupBeamSearcher {
            group->is_done(parameters);
            if (!group->done) {
                for (const Beam& beam : group->ongoing) {
-                    next_tokens.push_back({beam.tokens.back(), beam.global_beam_idx});
+                    next_tokens.push_back({beam.tokens.back(), int32_t(beam.global_beam_idx)});


Why do we need to transpose tokens/beams to pack it into TokenToBeam? Can we just return two vectors: tokens and corresponding beam global indices? Later in the code we can just use elements in vectors as a host memory for tensors input_ids and beam_idx without need to manually copy each element in the loop. It is negligible from the performance side, but the code would be simpler. What do you think?

slyalin · 2023-12-11T13:51:03Z

llm/cpp/llm.cpp

-        shape.at(0) = 1;
-        ireq.get_input_tensor(idx).set_shape(shape);
-    }
+    ireq.get_tensor("beam_idx").set_shape({1});


You can get a tensor once and use it in all places below, it would be shorter.

It's no longer applicable for beam_idx, but I did it for position_ids

slyalin · 2023-12-11T14:18:19Z

llm/cpp/llm.cpp

+        int64_t* input_tokens = ireq.get_tensor("input_ids").data<int64_t>();
+        int32_t* beam_idx = ireq.get_tensor("beam_idx").data<int32_t>();
        for (size_t batch_idx = 0; batch_idx < batch_size; ++batch_idx) {
-            ireq.get_tensor("input_ids").data<int64_t>()[batch_idx] = next_tokens.at(batch_idx).token_idx;
-            for (size_t tensor_idx = 3; tensor_idx < inputs.size(); ++tensor_idx) {
-                ov::Tensor present = ireq.get_output_tensor(tensor_idx - 2);
-                ov::Shape present_begin = {next_tokens.at(batch_idx).beam_idx, 0, 0, 0};
-                ov::Shape present_end = present.get_shape();
-                present_end.at(0) = next_tokens.at(batch_idx).beam_idx + 1;
-                ov::Tensor past = ireq.get_input_tensor(tensor_idx);
-                ov::Shape past_begin = {batch_idx, 0, 0, 0};
-                ov::Shape past_end = past.get_shape();
-                past_end.at(0) = batch_idx + 1;
-                ov::Tensor{present, present_begin, present_end}.copy_to(ov::Tensor{past, past_begin, past_end});
-            }
+            input_tokens[batch_idx] = next_tokens.at(batch_idx).token_idx;
+            beam_idx[batch_idx] = next_tokens.at(batch_idx).beam_idx;


Can we extend API of Tensor to add some sugar: ireq.get_tensor("beam_idx").copy_from(beam_idx) where beam_idx is std::vector<int32_t> in this case. Setting shapes is still required, the number of elements in the tensor and the vector should match.

It would allow us to write something like the following:

beam_idx_tensor.copy_from(beams) input_ids_tensor.copy_from(tokens)

if the changes suggested in other comments are applied.

We can create more generic ctor which accepts std::vector:

template <typename T> Tensor(const std::vector<T>& vec); // creates ov::Tensor on pre-allocated memory

And then:

ov::Tensor(beams).copy_to(beam_idx_tensor);

How do you think?

When you accepting a vector as ctor argument you need to think whether it is copied or shared. The existing ctors in Tensor wrap a memory passed as a pointer. If we do that for a vector it would be error-prone because users will start passing temporary created vector objects right in the ctor because it looks convenient.

If you copy the content instead of sharing, it looks more safe, but it will bring another copy if copy_to is used. We can use set_tensor, but it looks like something that intentionally has more copies underneath, which is probably OK in this particular case. This is not performance critical part in this sample and we can choose any way, but I am afraid that we can introduce bad practices that doesn't scale well with bigger tensors.

https://docs.opencv.org/4.x/d3/d63/classcv_1_1Mat.html#a23b182c4ffd46abe38f460c1480ae887

slyalin · 2023-12-11T14:22:28Z

llm/cpp/llm.cpp

        attention_mask.set_shape(mask_shape);
        std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1);
        ireq.get_tensor("position_ids").set_shape({batch_size, 1});
        std::fill_n(ireq.get_tensor("position_ids").data<int64_t>(), batch_size, mask_shape.at(1) - 1);


Candidates to be moved under another text-generation building block.

…ck_strings() (#8) * Replace compex loops with index access, reuse pack_strings() and unpack_strings() * shape.at(0)->shape[0] * {1, std::numeric_limits<ov::Dimension::value_type>::max()}->-1 * Move pack_strings() * Upgrade optimum * int32_t->int64_t

ilya-lavrenov · 2023-12-12T20:37:40Z

llm/cpp/llm.cpp

-            }
-        }
+        position_ids.set_shape({batch_size, 1});
+        std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1);


maybe we can also create a Tensor with a ctor which initializes all elements with a given value?

ireq.set_tensor("position_ids", ov::Tensor(ov::element::i64, {batch_size, 1}, length_count + prompt_lenght)); ireq.set_tensor("attention_mask", ov::Tensor(ov::element::i64, {batch_size, length_count + prompt_lenght + 1}, 1));

instead of:

ov::Tensor attention_mask = ireq.get_tensor("attention_mask"); ov::Shape mask_shape{batch_size, attention_mask.get_shape().at(1) + 1}; attention_mask.set_shape(mask_shape); std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1); position_ids.set_shape({batch_size, 1}); std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1);

@slyalin @Wovchena

Ideally ov::Tensor should have functions similar to torch.Tensor. Broadcasting and all the stuff. Sure you can use use cases like this to prioritize function implementation. But my perception was that ov tries to keep Tensor API minimal

But we also want to avoid boilerplate code in samples, so we can extend ov::Tensor with more helpers / constructors

The proposed method is similar to https://pytorch.org/docs/stable/generated/torch.Tensor.new_full.html#torch.Tensor.new_full
Maybe we can create static methods ov::Tensor::new_full, ov::Tensor::new_ones

Wovchena · 2023-12-18T08:02:46Z

Included in openvinotoolkit#81

* enable * libtbb-dev * move * slash * install * core_genai_dev * remove export * rreorganaise components * add SOVERSION, and requirements-build.txt * repalce SKBUILD with EXCLUDE_FROM_ALL because the effect is the same * fix NAMELINK_COMPONENT * remove extraline * add soft restrictions * Fix build to unblock packaging * improve naming * install samples * remove quotes * use main target name because an alias can't be specified in cmake --target * define CMAKE_BUILD_PARALLEL_LEVEL * Ensure ./requirements-build.txt won't outdate * Use ./requirements-build.txt in python lib build * Add missing && * Test Debug * add matrix for windows_genai_package * openvino_tokenizers from form * update openvino_tokenizers * update openvino_tokenizers * update openvino_tokenizers * revert openvino_tokenizers * tokenizers from fork * update tokenizers * centos7_2024.2.0.dev * copy target * revert tokenizers * reapply useful changes * copy so only * fix CMAKE_BUILD_PARALLEL_LEVEL

New shape approach

Wovchena added 3 commits December 11, 2023 14:27

llm: use stateful model

20a7db5

Simplify mask_shape, align the order of inputs filling

08e0b3c

Move ireq.get_tensor(input_ids).data<int64_t>() out of loop

25a2ddb

slyalin reviewed Dec 11, 2023

View reviewed changes

Wovchena added 3 commits December 12, 2023 22:07

Merge branch 'ref.py' into llm-use-stateful-model

077fdbd

Reuse vector memory

f7df4c7

Reserve beam_count

ec3322a

ilya-lavrenov reviewed Dec 12, 2023

View reviewed changes

Wovchena mentioned this pull request Dec 13, 2023

Fix causal_lm cpp demo for llama architecture openvinotoolkit/openvino.genai#71

Merged

Wovchena closed this Dec 18, 2023

Wovchena pushed a commit that referenced this pull request Jun 19, 2024

Merge pull request #8 from ilya-lavrenov/new-kv-cache-shape-approach

d2aebd9

New shape approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm: use stateful model #8

llm: use stateful model #8

Wovchena commented Dec 11, 2023

slyalin Dec 11, 2023

Wovchena Dec 12, 2023

slyalin Dec 11, 2023

Wovchena Dec 12, 2023

slyalin Dec 11, 2023

ilya-lavrenov Dec 12, 2023

slyalin Dec 12, 2023

ilya-lavrenov Dec 12, 2023

slyalin Dec 11, 2023

ilya-lavrenov Dec 12, 2023 •

edited

Loading

Wovchena Dec 12, 2023

ilya-lavrenov Dec 12, 2023

ilya-lavrenov Dec 12, 2023

Wovchena commented Dec 18, 2023

llm: use stateful model #8

llm: use stateful model #8

Conversation

Wovchena commented Dec 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilya-lavrenov Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wovchena commented Dec 18, 2023

ilya-lavrenov Dec 12, 2023 •

edited

Loading