Skip to content

Conversation

@smirnov-alexey
Copy link
Contributor

@smirnov-alexey smirnov-alexey commented Oct 2, 2025

E-186667
#32025 should allow an easy additional change to set kvcache from generate model to prefill's input. Even if will be a strided view, all PRs combined (including follow up) should reduce overall memory consumption, since the kvcache copy is done in async

@smirnov-alexey smirnov-alexey requested review from a team as code owners October 2, 2025 13:53
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2025
Copy link
Contributor

@esmirno esmirno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall good, please review comments and address if make sense

return;
}

bool ov::npuw::IBaseInferRequest::is_not_stored_io(const ov::Output<const ov::Node>& port) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually it is more readable if to use direct questions - is_io_stored - or is_stored

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

break;
}
}
for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_io already set - better to use lambda where you can just return immediately , or even here you can return in each loop and leave OV_TRHOW or NPUW_ASSERT(true if need to log error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked with a separate is_io simple function, thanks

ov::npuw::TensorPtr ov::npuw::IBaseInferRequest::allocMem(const ov::element::Type type,
const ov::Shape& shape,
const std::string& device) {
const std::string& device) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alloc usually not const? why this needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just make m_footprint mutable? since it is kind of not changing state of inferrequest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for get_tensor() const

@dmatveev dmatveev added this to the 2025.4 milestone Oct 10, 2025
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Oct 13, 2025
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Oct 13, 2025
@dmatveev dmatveev removed this from the 2025.4 milestone Oct 17, 2025
@dmatveev
Copy link
Contributor

Added the do-not-merge label to wait for 2025.4 CF.

Copy link
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are okay-ish but please get rid of is_io(). It is always a quadratic way to say true (which it always is)

Comment on lines +202 to +204
if (is_io(port)) {
m_port_to_tensor.at(port).persistent = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_tensor is always IO. It is the external API, it is never called for the internal (cross-subgraph) connections. So I believe this code here is not necessary and the above assignment should be TensorStorage{.., true} ?

Comment on lines +232 to +244
bool ov::npuw::IBaseInferRequest::is_io(const ov::Output<const ov::Node>& port) const {
for (std::size_t i = 0; i < m_npuw_model->inputs().size(); ++i) {
if (m_npuw_model->inputs()[i] == port) {
return true;
}
}
for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) {
if (m_npuw_model->outputs()[i] == port) {
return true;
}
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's 128 checks for a regular language model so to say.
And this method will be called ~the same number of times as part of the pipeline setup.

So 128*128 = 16K checks for nothing.

Even if it is called for the prefill's input kvcache, that's 64*128 = 8K checks for nothing

allocOut(iport, m_npuw_model->funcall_mem_device(real_idx));
m_spatial_io[real_idx].input_tails[p.idx] = allocOut(
iport,
m_npuw_model->funcall_mem_device(real_idx)); // should it be handled lazy way as well?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

allocOut(oport, m_npuw_model->funcall_mem_device(real_idx));
m_spatial_io[real_idx].output_tails[out_idx] = allocOut(
oport,
m_npuw_model->funcall_mem_device(real_idx)); // should it be handled lazy way as well?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin do not merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants