[NPUW] Introduce lazy I/O allocation for infer requests #32277

smirnov-alexey · 2025-10-02T13:53:25Z

E-186667
#32025 should allow an easy additional change to set kvcache from generate model to prefill's input. Even if will be a strided view, all PRs combined (including follow up) should reduce overall memory consumption, since the kvcache copy is done in async

…into as/npuw_lazy_io_alloc

esmirno

overall good, please review comments and address if make sense

esmirno · 2025-10-09T16:39:17Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

    return;
 }

+bool ov::npuw::IBaseInferRequest::is_not_stored_io(const ov::Output<const ov::Node>& port) const {


usually it is more readable if to use direct questions - is_io_stored - or is_stored

Done, thanks!

esmirno · 2025-10-09T16:44:07Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

+                break;
+            }
+        }
+        for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) {


is_io already set - better to use lambda where you can just return immediately , or even here you can return in each loop and leave OV_TRHOW or NPUW_ASSERT(true if need to log error

Reworked with a separate is_io simple function, thanks

esmirno · 2025-10-09T16:44:44Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

 ov::npuw::TensorPtr ov::npuw::IBaseInferRequest::allocMem(const ov::element::Type type,
                                                          const ov::Shape& shape,
-                                                          const std::string& device) {
+                                                          const std::string& device) const {


alloc usually not const? why this needed

maybe just make m_footprint mutable? since it is kind of not changing state of inferrequest

It's needed for get_tensor() const

…into as/npuw_lazy_io_alloc

dmatveev · 2025-10-17T13:48:22Z

Added the do-not-merge label to wait for 2025.4 CF.

…into as/npuw_lazy_io_alloc

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

dmatveev

The changes are okay-ish but please get rid of is_io(). It is always a quadratic way to say true (which it always is)

dmatveev · 2025-11-13T15:03:07Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

+    if (is_io(port)) {
+        m_port_to_tensor.at(port).persistent = true;
+    }


set_tensor is always IO. It is the external API, it is never called for the internal (cross-subgraph) connections. So I believe this code here is not necessary and the above assignment should be TensorStorage{.., true} ?

dmatveev · 2025-11-13T15:03:45Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

+bool ov::npuw::IBaseInferRequest::is_io(const ov::Output<const ov::Node>& port) const {
+    for (std::size_t i = 0; i < m_npuw_model->inputs().size(); ++i) {
+        if (m_npuw_model->inputs()[i] == port) {
+            return true;
+        }
+    }
+    for (std::size_t i = 0; i < m_npuw_model->outputs().size(); ++i) {
+        if (m_npuw_model->outputs()[i] == port) {
+            return true;
+        }
+    }
+    return false;
+}


That's 128 checks for a regular language model so to say.
And this method will be called ~the same number of times as part of the pipeline setup.

So 128*128 = 16K checks for nothing.

Even if it is called for the prefill's input kvcache, that's 64*128 = 8K checks for nothing

dmatveev · 2025-11-13T16:19:19Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

-                            allocOut(iport, m_npuw_model->funcall_mem_device(real_idx));
+                        m_spatial_io[real_idx].input_tails[p.idx] = allocOut(
+                            iport,
+                            m_npuw_model->funcall_mem_device(real_idx));  // should it be handled lazy way as well?


dmatveev · 2025-11-13T16:19:25Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

-                            allocOut(oport, m_npuw_model->funcall_mem_device(real_idx));
+                        m_spatial_io[real_idx].output_tails[out_idx] = allocOut(
+                            oport,
+                            m_npuw_model->funcall_mem_device(real_idx));  // should it be handled lazy way as well?


smirnov-alexey requested a review from dmatveev October 2, 2025 13:53

smirnov-alexey assigned dmatveev Oct 2, 2025

smirnov-alexey requested review from a team as code owners October 2, 2025 13:53

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 2, 2025

smirnov-alexey added 3 commits October 2, 2025 13:54

Introduce lazy memory allocation for ireq's I/O

fcceeb7

Fix no tensor being present in the storage

f01b8d5

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

f6fbf64

…into as/npuw_lazy_io_alloc

esmirno approved these changes Oct 9, 2025

View reviewed changes

Address review comments

bd27bbc

dmatveev added this to the 2025.4 milestone Oct 10, 2025

dmatveev added the do not merge label Oct 10, 2025

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

b258ed0

…into as/npuw_lazy_io_alloc

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 13, 2025

smirnov-alexey force-pushed the as/npuw_lazy_io_alloc branch from 804c548 to b258ed0 Compare October 13, 2025 15:31

github-actions bot removed the category: build OpenVINO cmake script / infra label Oct 13, 2025

smirnov-alexey removed the do not merge label Oct 15, 2025

dmatveev removed this from the 2025.4 milestone Oct 17, 2025

dmatveev added the do not merge label Oct 17, 2025

smirnov-alexey mentioned this pull request Oct 31, 2025

[NPUW] Share kvcache between prefill and generate when chunking is enabled #32642

Open

dmatveev added this to the 2026.0 milestone Oct 31, 2025

smirnov-alexey added 5 commits November 5, 2025 12:22

Fix concurrency issue with iterator invalidation

a54d529

Merge branch 'master' of https://github.com/openvinotoolkit/openvino …

ec23004

…into as/npuw_lazy_io_alloc

Refactoring

a09921f

Fix merge

8146143

Protect get_tensor by mutex

42b7c1d

smirnov-alexey commented Nov 11, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey commented Nov 11, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp Show resolved Hide resolved

smirnov-alexey commented Nov 11, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp Show resolved Hide resolved

smirnov-alexey commented Nov 11, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp Show resolved Hide resolved

smirnov-alexey added 2 commits November 11, 2025 16:28

Address review comments

ea2b7a0

Merge branch 'master' into as/npuw_lazy_io_alloc

3c1120d

dmatveev reviewed Nov 13, 2025

View reviewed changes

[NPUW] Introduce lazy I/O allocation for infer requests #32277

Are you sure you want to change the base?

[NPUW] Introduce lazy I/O allocation for infer requests #32277

Conversation

smirnov-alexey commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esmirno left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smirnov-alexey commented Oct 2, 2025 •

edited

Loading