Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm: use stateful model #8

Closed
wants to merge 6 commits into from
Closed

llm: use stateful model #8

wants to merge 6 commits into from

Conversation

Wovchena
Copy link
Owner

No description provided.

@@ -234,7 +234,7 @@ struct GroupBeamSearcher {
group->is_done(parameters);
if (!group->done) {
for (const Beam& beam : group->ongoing) {
next_tokens.push_back({beam.tokens.back(), beam.global_beam_idx});
next_tokens.push_back({beam.tokens.back(), int32_t(beam.global_beam_idx)});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to transpose tokens/beams to pack it into TokenToBeam? Can we just return two vectors: tokens and corresponding beam global indices? Later in the code we can just use elements in vectors as a host memory for tensors input_ids and beam_idx without need to manually copy each element in the loop. It is negligible from the performance side, but the code would be simpler. What do you think?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

shape.at(0) = 1;
ireq.get_input_tensor(idx).set_shape(shape);
}
ireq.get_tensor("beam_idx").set_shape({1});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get a tensor once and use it in all places below, it would be shorter.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer applicable for beam_idx, but I did it for position_ids

llm/cpp/llm.cpp Outdated
Comment on lines 64 to 68
int64_t* input_tokens = ireq.get_tensor("input_ids").data<int64_t>();
int32_t* beam_idx = ireq.get_tensor("beam_idx").data<int32_t>();
for (size_t batch_idx = 0; batch_idx < batch_size; ++batch_idx) {
ireq.get_tensor("input_ids").data<int64_t>()[batch_idx] = next_tokens.at(batch_idx).token_idx;
for (size_t tensor_idx = 3; tensor_idx < inputs.size(); ++tensor_idx) {
ov::Tensor present = ireq.get_output_tensor(tensor_idx - 2);
ov::Shape present_begin = {next_tokens.at(batch_idx).beam_idx, 0, 0, 0};
ov::Shape present_end = present.get_shape();
present_end.at(0) = next_tokens.at(batch_idx).beam_idx + 1;
ov::Tensor past = ireq.get_input_tensor(tensor_idx);
ov::Shape past_begin = {batch_idx, 0, 0, 0};
ov::Shape past_end = past.get_shape();
past_end.at(0) = batch_idx + 1;
ov::Tensor{present, present_begin, present_end}.copy_to(ov::Tensor{past, past_begin, past_end});
}
input_tokens[batch_idx] = next_tokens.at(batch_idx).token_idx;
beam_idx[batch_idx] = next_tokens.at(batch_idx).beam_idx;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend API of Tensor to add some sugar: ireq.get_tensor("beam_idx").copy_from(beam_idx) where beam_idx is std::vector<int32_t> in this case. Setting shapes is still required, the number of elements in the tensor and the vector should match.

It would allow us to write something like the following:

beam_idx_tensor.copy_from(beams)
input_ids_tensor.copy_from(tokens)

if the changes suggested in other comments are applied.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create more generic ctor which accepts std::vector:

template <typename T>
Tensor(const std::vector<T>& vec); // creates ov::Tensor on pre-allocated memory

And then:

ov::Tensor(beams).copy_to(beam_idx_tensor);

How do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you accepting a vector as ctor argument you need to think whether it is copied or shared. The existing ctors in Tensor wrap a memory passed as a pointer. If we do that for a vector it would be error-prone because users will start passing temporary created vector objects right in the ctor because it looks convenient.

If you copy the content instead of sharing, it looks more safe, but it will bring another copy if copy_to is used. We can use set_tensor, but it looks like something that intentionally has more copies underneath, which is probably OK in this particular case. This is not performance critical part in this sample and we can choose any way, but I am afraid that we can introduce bad practices that doesn't scale well with bigger tensors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm/cpp/llm.cpp Outdated
Comment on lines 59 to 62
attention_mask.set_shape(mask_shape);
std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1);
ireq.get_tensor("position_ids").set_shape({batch_size, 1});
std::fill_n(ireq.get_tensor("position_ids").data<int64_t>(), batch_size, mask_shape.at(1) - 1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Candidates to be moved under another text-generation building block.

Wovchena added a commit that referenced this pull request Dec 12, 2023
…ck_strings() (#8)

* Replace compex loops with index access, reuse pack_strings() and unpack_strings()

* shape.at(0)->shape[0]

* {1, std::numeric_limits<ov::Dimension::value_type>::max()}->-1

* Move pack_strings()

* Upgrade optimum

* int32_t->int64_t
}
}
position_ids.set_shape({batch_size, 1});
std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1);
Copy link

@ilya-lavrenov ilya-lavrenov Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also create a Tensor with a ctor which initializes all elements with a given value?

ireq.set_tensor("position_ids", ov::Tensor(ov::element::i64, {batch_size, 1}, length_count + prompt_lenght));
ireq.set_tensor("attention_mask", ov::Tensor(ov::element::i64, {batch_size, length_count + prompt_lenght + 1}, 1));

instead of:

ov::Tensor attention_mask = ireq.get_tensor("attention_mask");
ov::Shape mask_shape{batch_size, attention_mask.get_shape().at(1) + 1};
attention_mask.set_shape(mask_shape);
std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1);
position_ids.set_shape({batch_size, 1});
std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1);

@slyalin @Wovchena

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally ov::Tensor should have functions similar to torch.Tensor. Broadcasting and all the stuff. Sure you can use use cases like this to prioritize function implementation. But my perception was that ov tries to keep Tensor API minimal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we also want to avoid boilerplate code in samples, so we can extend ov::Tensor with more helpers / constructors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed method is similar to https://pytorch.org/docs/stable/generated/torch.Tensor.new_full.html#torch.Tensor.new_full
Maybe we can create static methods ov::Tensor::new_full, ov::Tensor::new_ones

@Wovchena
Copy link
Owner Author

Included in openvinotoolkit#81

@Wovchena Wovchena closed this Dec 18, 2023
Wovchena added a commit that referenced this pull request May 27, 2024
* enable

* libtbb-dev

* move

* slash

* install

* core_genai_dev

* remove export

* rreorganaise components

* add SOVERSION, and requirements-build.txt

* repalce SKBUILD with EXCLUDE_FROM_ALL because the effect is the same

* fix NAMELINK_COMPONENT

* remove extraline

* add soft restrictions

* Fix build to unblock packaging

* improve naming

* install samples

* remove quotes

* use main target name because an alias can't be specified in cmake --target

* define CMAKE_BUILD_PARALLEL_LEVEL

* Ensure ./requirements-build.txt won't outdate

* Use ./requirements-build.txt in python lib build

* Add missing &&

* Test Debug

* add matrix for windows_genai_package

* openvino_tokenizers from form

* update openvino_tokenizers

* update openvino_tokenizers

* update openvino_tokenizers

* revert openvino_tokenizers

* tokenizers from fork

* update tokenizers

* centos7_2024.2.0.dev

* copy target

* revert tokenizers

* reapply useful changes

* copy so only

* fix CMAKE_BUILD_PARALLEL_LEVEL
Wovchena pushed a commit that referenced this pull request Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants