-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[NPUW] Introduce lazy I/O allocation for infer requests #32277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPUW] Introduce lazy I/O allocation for infer requests #32277
Conversation
esmirno
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall good, please review comments and address if make sense
| return; | ||
| } | ||
|
|
||
| bool ov::npuw::IBaseInferRequest::is_not_stored_io(const ov::Output<const ov::Node>& port) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually it is more readable if to use direct questions - is_io_stored - or is_stored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| break; | ||
| } | ||
| } | ||
| for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_io already set - better to use lambda where you can just return immediately , or even here you can return in each loop and leave OV_TRHOW or NPUW_ASSERT(true if need to log error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked with a separate is_io simple function, thanks
| ov::npuw::TensorPtr ov::npuw::IBaseInferRequest::allocMem(const ov::element::Type type, | ||
| const ov::Shape& shape, | ||
| const std::string& device) { | ||
| const std::string& device) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alloc usually not const? why this needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just make m_footprint mutable? since it is kind of not changing state of inferrequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed for get_tensor() const
…into as/npuw_lazy_io_alloc
804c548 to
b258ed0
Compare
|
Added the do-not-merge label to wait for 2025.4 CF. |
src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp
Outdated
Show resolved
Hide resolved
dmatveev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are okay-ish but please get rid of is_io(). It is always a quadratic way to say true (which it always is)
| if (is_io(port)) { | ||
| m_port_to_tensor.at(port).persistent = true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_tensor is always IO. It is the external API, it is never called for the internal (cross-subgraph) connections. So I believe this code here is not necessary and the above assignment should be TensorStorage{.., true} ?
| bool ov::npuw::IBaseInferRequest::is_io(const ov::Output<const ov::Node>& port) const { | ||
| for (std::size_t i = 0; i < m_npuw_model->inputs().size(); ++i) { | ||
| if (m_npuw_model->inputs()[i] == port) { | ||
| return true; | ||
| } | ||
| } | ||
| for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) { | ||
| if (m_npuw_model->outputs()[i] == port) { | ||
| return true; | ||
| } | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's 128 checks for a regular language model so to say.
And this method will be called ~the same number of times as part of the pipeline setup.
So 128*128 = 16K checks for nothing.
Even if it is called for the prefill's input kvcache, that's 64*128 = 8K checks for nothing
| allocOut(iport, m_npuw_model->funcall_mem_device(real_idx)); | ||
| m_spatial_io[real_idx].input_tails[p.idx] = allocOut( | ||
| iport, | ||
| m_npuw_model->funcall_mem_device(real_idx)); // should it be handled lazy way as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
| allocOut(oport, m_npuw_model->funcall_mem_device(real_idx)); | ||
| m_spatial_io[real_idx].output_tails[out_idx] = allocOut( | ||
| oport, | ||
| m_npuw_model->funcall_mem_device(real_idx)); // should it be handled lazy way as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
E-186667
#32025 should allow an easy additional change to set kvcache from generate model to prefill's input. Even if will be a strided view, all PRs combined (including follow up) should reduce overall memory consumption, since the kvcache copy is done in async