-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llm: use stateful model #8
Conversation
llm/cpp/group_beam_searcher.hpp
Outdated
@@ -234,7 +234,7 @@ struct GroupBeamSearcher { | |||
group->is_done(parameters); | |||
if (!group->done) { | |||
for (const Beam& beam : group->ongoing) { | |||
next_tokens.push_back({beam.tokens.back(), beam.global_beam_idx}); | |||
next_tokens.push_back({beam.tokens.back(), int32_t(beam.global_beam_idx)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to transpose tokens/beams to pack it into TokenToBeam
? Can we just return two vectors: tokens and corresponding beam global indices? Later in the code we can just use elements in vectors as a host memory for tensors input_ids
and beam_idx
without need to manually copy each element in the loop. It is negligible from the performance side, but the code would be simpler. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
shape.at(0) = 1; | ||
ireq.get_input_tensor(idx).set_shape(shape); | ||
} | ||
ireq.get_tensor("beam_idx").set_shape({1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get a tensor once and use it in all places below, it would be shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no longer applicable for beam_idx
, but I did it for position_ids
llm/cpp/llm.cpp
Outdated
int64_t* input_tokens = ireq.get_tensor("input_ids").data<int64_t>(); | ||
int32_t* beam_idx = ireq.get_tensor("beam_idx").data<int32_t>(); | ||
for (size_t batch_idx = 0; batch_idx < batch_size; ++batch_idx) { | ||
ireq.get_tensor("input_ids").data<int64_t>()[batch_idx] = next_tokens.at(batch_idx).token_idx; | ||
for (size_t tensor_idx = 3; tensor_idx < inputs.size(); ++tensor_idx) { | ||
ov::Tensor present = ireq.get_output_tensor(tensor_idx - 2); | ||
ov::Shape present_begin = {next_tokens.at(batch_idx).beam_idx, 0, 0, 0}; | ||
ov::Shape present_end = present.get_shape(); | ||
present_end.at(0) = next_tokens.at(batch_idx).beam_idx + 1; | ||
ov::Tensor past = ireq.get_input_tensor(tensor_idx); | ||
ov::Shape past_begin = {batch_idx, 0, 0, 0}; | ||
ov::Shape past_end = past.get_shape(); | ||
past_end.at(0) = batch_idx + 1; | ||
ov::Tensor{present, present_begin, present_end}.copy_to(ov::Tensor{past, past_begin, past_end}); | ||
} | ||
input_tokens[batch_idx] = next_tokens.at(batch_idx).token_idx; | ||
beam_idx[batch_idx] = next_tokens.at(batch_idx).beam_idx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend API of Tensor to add some sugar: ireq.get_tensor("beam_idx").copy_from(beam_idx)
where beam_idx
is std::vector<int32_t>
in this case. Setting shapes is still required, the number of elements in the tensor and the vector should match.
It would allow us to write something like the following:
beam_idx_tensor.copy_from(beams)
input_ids_tensor.copy_from(tokens)
if the changes suggested in other comments are applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create more generic ctor which accepts std::vector
:
template <typename T>
Tensor(const std::vector<T>& vec); // creates ov::Tensor on pre-allocated memory
And then:
ov::Tensor(beams).copy_to(beam_idx_tensor);
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you accepting a vector as ctor argument you need to think whether it is copied or shared. The existing ctors in Tensor wrap a memory passed as a pointer. If we do that for a vector it would be error-prone because users will start passing temporary created vector objects right in the ctor because it looks convenient.
If you copy the content instead of sharing, it looks more safe, but it will bring another copy if copy_to is used. We can use set_tensor, but it looks like something that intentionally has more copies underneath, which is probably OK in this particular case. This is not performance critical part in this sample and we can choose any way, but I am afraid that we can introduce bad practices that doesn't scale well with bigger tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm/cpp/llm.cpp
Outdated
attention_mask.set_shape(mask_shape); | ||
std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1); | ||
ireq.get_tensor("position_ids").set_shape({batch_size, 1}); | ||
std::fill_n(ireq.get_tensor("position_ids").data<int64_t>(), batch_size, mask_shape.at(1) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Candidates to be moved under another text-generation building block.
…ck_strings() (#8) * Replace compex loops with index access, reuse pack_strings() and unpack_strings() * shape.at(0)->shape[0] * {1, std::numeric_limits<ov::Dimension::value_type>::max()}->-1 * Move pack_strings() * Upgrade optimum * int32_t->int64_t
} | ||
} | ||
position_ids.set_shape({batch_size, 1}); | ||
std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can also create a Tensor with a ctor which initializes all elements with a given value?
ireq.set_tensor("position_ids", ov::Tensor(ov::element::i64, {batch_size, 1}, length_count + prompt_lenght));
ireq.set_tensor("attention_mask", ov::Tensor(ov::element::i64, {batch_size, length_count + prompt_lenght + 1}, 1));
instead of:
ov::Tensor attention_mask = ireq.get_tensor("attention_mask");
ov::Shape mask_shape{batch_size, attention_mask.get_shape().at(1) + 1};
attention_mask.set_shape(mask_shape);
std::fill_n(attention_mask.data<int64_t>(), shape_size(mask_shape), 1);
position_ids.set_shape({batch_size, 1});
std::fill_n(position_ids.data<int64_t>(), batch_size, mask_shape.at(1) - 1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally ov::Tensor should have functions similar to torch.Tensor. Broadcasting and all the stuff. Sure you can use use cases like this to prioritize function implementation. But my perception was that ov tries to keep Tensor API minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we also want to avoid boilerplate code in samples, so we can extend ov::Tensor with more helpers / constructors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed method is similar to https://pytorch.org/docs/stable/generated/torch.Tensor.new_full.html#torch.Tensor.new_full
Maybe we can create static methods ov::Tensor::new_full
, ov::Tensor::new_ones
Included in openvinotoolkit#81 |
* enable * libtbb-dev * move * slash * install * core_genai_dev * remove export * rreorganaise components * add SOVERSION, and requirements-build.txt * repalce SKBUILD with EXCLUDE_FROM_ALL because the effect is the same * fix NAMELINK_COMPONENT * remove extraline * add soft restrictions * Fix build to unblock packaging * improve naming * install samples * remove quotes * use main target name because an alias can't be specified in cmake --target * define CMAKE_BUILD_PARALLEL_LEVEL * Ensure ./requirements-build.txt won't outdate * Use ./requirements-build.txt in python lib build * Add missing && * Test Debug * add matrix for windows_genai_package * openvino_tokenizers from form * update openvino_tokenizers * update openvino_tokenizers * update openvino_tokenizers * revert openvino_tokenizers * tokenizers from fork * update tokenizers * centos7_2024.2.0.dev * copy target * revert tokenizers * reapply useful changes * copy so only * fix CMAKE_BUILD_PARALLEL_LEVEL
No description provided.