New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[CPU]whisper readvalue optimize #26130

Open

xipingyan wants to merge 96 commits into openvinotoolkit:master from xipingyan:xp/whisper_readvalue_optimize

+928 −55

Contributor

xipingyan commented Aug 20, 2024 •

edited

Loading

Details:

New ReadValueWithSubgraph node.
Move ReadValue's initial subgraph nodes to ReadValueWithSubgraph
Mirror ReadValueWithSubgraph to MemoryInput
*Upgrade MemoryInput and MemoryInputBase in order to let them support multiple inputs"
Call new interface Init and Activate of ov::intel_cpu::Graph, avoid to memory copy. Refer: [CPU] Introduce SubModel op and Composite node #25385
Depends on [CPU] Drop redundant MemoryOutput nodes #27189

Tickets:

128743

xipingyan added 3 commits

August 19, 2024 02:18


          Add profiler for CPU plugin.

Profile each node execute time.
Support Static and Dynamic infer.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Mark ReadValue's inputs and corresponding Assign.

451c76d

If reset is not called, these marked nodes also desn't need to be executed.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Only mark: ReadValue->Assign pairs.

137beee

Signed-off-by: xipingya <xiping.yan@intel.com>

xipingyan requested a review from maxnick

August 20, 2024 08:33

github-actions bot added category: Core category: CPU category: transformations category: CPP API labels

xipingyan requested review from yuxu42 and ceciliapeng2011

August 20, 2024 08:34

xipingyan added 2 commits

August 21, 2024 06:18


          Optimize pattern match.

737fe5c

Signed-off-by: xipingya <xiping.yan@intel.com>


          transformation test pass.

6b05005

Signed-off-by: xipingya <xiping.yan@intel.com>

github-actions bot removed the category: transformations label

xipingyan added 11 commits

September 6, 2024 02:10


          Test pass.

58d9f6f

Signed-off-by: xipingya <xiping.yan@intel.com>


          Fix error: one param link to mulitple ReadValueWithSubgraphNode

d54dc25

Signed-off-by: xipingya <xiping.yan@intel.com>


          Add submodel infer to MemoryInput::runDynamic

a533d73

Signed-off-by: xipingya <xiping.yan@intel.com>


          Debug code

f7339e3


          Merge remote-tracking branch 'origin/master' into xp/whisper_readvalu…

e142a06

…e_optimize


          fix merge error

4a2dba0


          Dynamic shape test pass

bf7e493

Signed-off-by: xipingya <xiping.yan@intel.com>


          test whisper pass

d90144e


          Disable debug log to test performance. Got expected result:

5a98e7b

decoder network: 20ms -> 5 ms.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Add test.

577721d


          Remove stateName in ov::Node

c23062e

Signed-off-by: xipingya <xiping.yan@intel.com>

github-actions bot removed category: Core category: CPP API labels

xipingyan added 4 commits

September 10, 2024 01:27


          Add env: ENABLE_RV for comprison test.

592919b


          Merge branch 'master' into xp/whisper_readvalue_optimize

f134307


          Merge branch 'xp/whisper_readvalue_optimize' of https://github.com/xi…

258c3c8

…pingyan/openvino into xp/whisper_readvalue_optimize


          rm debug log

5c771b0


          remove memoryoutputsinglestub

dd5dd8a

Signed-off-by: xipingya <xiping.yan@intel.com>

xipingyan force-pushed the xp/whisper_readvalue_optimize branch from 569f53f to dd5dd8a Compare

October 25, 2024 06:11

xipingyan commented

View reviewed changes

Contributor Author

xipingyan left a comment •

edited

Loading

hi @maxnick Please help review again, I have merged your PR.

src/plugins/intel_cpu/src/nodes/memory.hpp Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/memory.cpp Outdated Show resolved Hide resolved

ilya-lavrenov requested a review from as-suvorov

October 25, 2024 06:26

xipingyan added 3 commits

October 25, 2024 06:27


          Removed Assign node check when finding init_graph of ReadValue.

5e91806

Signed-off-by: xipingya <xiping.yan@intel.com>


          Let test support Indirect ReadValue Assgin Pair.

cf8a18f

Signed-off-by: xipingya <xiping.yan@intel.com>


          Merge remote-tracking branch 'origin/master' into xp/whisper_readvalu…

893888e

…e_optimize

github-actions bot removed category: IE Tests category: TEMPLATE labels

xipingyan added 6 commits

October 30, 2024 01:52


          Update OutputConfig desc

6fe7302


          Merge branch 'master' into xp/whisper_readvalue_optimize

2f14480


          Merge branch 'master' into xp/whisper_readvalue_optimize

786c59e


          1: Remove InitGraphStatefulModelInplace, ReadValueAssignTest can cove…

a998273

…r this.

2: Fix ReadValueAssignTest fail issue, just make sure "initOptimalPrimitiveDescriptor" don't change original primitive.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Merge branch 'master' into xp/whisper_readvalue_optimize

4abc1b0


          Merge branch 'master' into xp/whisper_readvalue_optimize

5d5ae39

CuriousPanCake pushed a commit to CuriousPanCake/openvino that referenced this pull request


          [CPU] Fuse SDPA before/after Reshape+Transpose Node to SDPA (openvino…

de05123

…toolkit#26819)

### Details:
- *Pattern: QKV_Reshape -> QKV_Transpose ->
SDPA->OUT_Transpse->OUT_Reshape*
 - *Fuse this pattern to: SDPA*
- *This hotspot can be observed after
openvinotoolkit#26130, this PR's
implementation doesn't depend on it.*

### Tickets:
 - *153616*

---------

Signed-off-by: xipingya <xiping.yan@intel.com>

xipingyan added 2 commits

November 11, 2024 08:01


          create a separate input memory objects instead of share them in subgr…

388c72a

…aph input memory.

avoid data corruption.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Move subGraph init into MemoryInput::initOptimalPrimitiveDescriptor()

d0fb105

Signed-off-by: xipingya <xiping.yan@intel.com>

maxnick reviewed

View reviewed changes

src/plugins/intel_cpu/src/nodes/memory.hpp Outdated

Comment on lines 190 to 200

+                  MemoryInput(const std::shared_ptr<ov::Node>& op, const GraphContext::CPtr ctx);
+                  MemoryInput(const std::string id,
+                              const std::string& name,
+                              const std::string& type,
+                              const Shape& output_shape,
+                              const ov::element::Type& output_prc,
+                              const GraphContext::CPtr context,
+                              const ov::optional<std::vector<Shape>>& input_shape,
+                              const ov::optional<std::vector<ov::element::Type>>& input_prc,
+                              std::shared_ptr<ov::Model> func,
+                              mode mode = mode::read_value_assign);

Contributor

maxnick Nov 13, 2024

Since you define MemoryInput constructors. Please remove the constructor inheritance (the line above).

Contributor Author

xipingyan Nov 14, 2024

Removed
using MemoryInputBase::MemoryInputBase;

src/plugins/intel_cpu/src/nodes/memory.hpp Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/memory.hpp Outdated

               private:
+                  std::shared_ptr<ov::Model> body = nullptr;
+                  ov::intel_cpu::Graph subGraph;

Contributor

maxnick Nov 13, 2024

Suggest using unique_ptr, to reduce the size of the MemoryInput object, as we don't always need to allocate memory for subGraph.

Contributor Author

xipingyan Nov 14, 2024

I don't think so, because body is also used in other places, and getSubGraph() also return it to outside.
If we use unique_ptr, it cannot be copied or assigned.

Contributor

maxnick Nov 14, 2024

I'm asking not about the body but about the subGraph

Contributor Author

xipingyan Nov 16, 2024

But I found cpu plugin only support to CPP 11, std::make_unique is enabled since CPP 14.
If I upgrade CPP 11 to 14, it will have a big impact.
So I use shared_ptr first.

src/plugins/intel_cpu/src/nodes/memory.cpp

+                  if (haveSubgraph() && isDynamic) {
+                      // Update to MemInpSingleShapeInfer
+                      shapeInference = PassThroughShapeInferFactory(body).makeShapeInfer();

Contributor

maxnick Nov 13, 2024

Are you sure that this is correct? The input subgraph being fused into MemoryInput, may produce shapes different from the Parameters attached to MemoryInput. In this case we should use InternalDynShapeInferFactory

Contributor Author

xipingyan Nov 14, 2024

As you said: "The input subgraph being fused into MemoryInput, may produce shapes different from the Parameters attached to MemoryInput".
This is my reason of adding this shape infer.
https://github.com/xipingyan/openvino/blob/b621364e611bf7304a2d286fb914d296f06a4ae5/src/plugins/intel_cpu/src/shape_inference/shape_inference_pass_through.hpp#L36-L49

we need to run m_body->validate_nodes_and_infer_types(); in order to get final real shape.

Contributor

maxnick Nov 14, 2024

We actually don't. We can postpone this step till execute. This is how the internal dynamic nodes work. Please take a look at the composite node also. This additional shape inference, which you introduced, adds additional unnecessary overhead as it's bein run every execute call while we really need the output shapes from the subgraph only when its processing is required (only when state is reset), in all the other cases we retrieve the output shape from the state itself.

Contributor Author

xipingyan Nov 16, 2024

I know it introduces unnecessary shape infer if there is no reset state. I will try to move them to runtime.

src/plugins/intel_cpu/src/nodes/memory.cpp Outdated

Comment on lines 673 to 674

		subGraph.Init(body, context, graphInputConfig, graphOutputConfig);
		}

Contributor

maxnick Nov 13, 2024

Please add a sanity check that the subgraph's actual input/output primitive descriptors are compatible with the ones from the MemoryInput node config.

Contributor Author

xipingyan Nov 14, 2024 •

edited

Loading

I also did other test, I found I seem not should add sanity check for input and output primitive.
In order to avoid insert reorder in outside of ReadValueWithSubgraph, inner subgraph input use ReadValueWithSubgraph parent primitives, and inner subgraph output uses ReadValueWithSubgraph child primitive. They maybe should be not compatible.

src/plugins/intel_cpu/src/transformations/cpu_opset/common/op/read_value_with_subgraph.cpp

+                  INTERNAL_OP_SCOPE(intel_cpu_ReadValueWithSubgraphNode_clone_with_new_inputs);
+                  check_new_args_count(this, new_args);
+                  auto op = std::make_shared<ov::intel_cpu::ReadValueWithSubgraph>();

Contributor

maxnick Nov 13, 2024

Shouldn't we pass the variable here?

...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.hpp Outdated Show resolved Hide resolved

...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.cpp Outdated Show resolved Hide resolved

...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.cpp Outdated Show resolved Hide resolved

...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.cpp

Comment on lines +49 to +52

+                          if (node->get_output_target_inputs(0).size() == 0u) {
+                              found_output = true;
+                              return;
+                          }

Contributor

maxnick Nov 13, 2024

It does make sense to store the whole DFS subtree, since this is a path to the output node, so checking if a child node is a part of such a subtree may speed up subsequent dfs calls.
In the current implementation we store only nodes visited in reverse_dfs, in the worst case, when we have 1/2N nodes of the ReadValue path all of them parents of the same subgraph leading to an output, so we will have 1/2N * 1/2N iterations, which is still O(N^2). But, if we store the output path as well, all the nodes will be visited once, giving O(N) complexity.

Contributor Author

xipingyan Nov 16, 2024 •

edited

Loading

Yes. all visited nodes also should be flagged. Updated.

xipingyan added 11 commits

November 14, 2024 10:58


          Merge branch 'master' into xp/whisper_readvalue_optimize

de904bb


          Temporarily skip this pattern. If MemoryInputSDPA supports Subgraph i…

b621364

…n the future, it may be deleted.

Signed-off-by: xipingya <xiping.yan@intel.com>


          1: Remove "using MemoryInputBase::MemoryInputBase;inheritance "

63bb470

2: Adopt parent configuration, avoid to insert reorder before the MemoryInput.
3: move prepare param to runDynamic, because it is not called each time.


          Remove assignedMem->redefineDesc(...

5d6c9de

It is responsibility of MemoryInputSingle or MemoryOutput


          update: std::shared_ptr<ov::Model> func = nullptr,

e2371d5


          Simplify code about ov::optional init.

800bca3

Add: CPU_GRAPH_OPTIMIZER_SCOPE(DropRedundantMemoryOutput_SubGraph);
before create edge, call graph.RemoveEdge(parentEdge);

Signed-off-by: xipingya <xiping.yan@intel.com>


          Correct grammar for test description.

84cddcd


          Recover remove parent edges code, because it will trigger fail for mu…

128e3ab

…litply parents edges.

Signed-off-by: xipingya <xiping.yan@intel.com>


          Just compare node with pointer,

5bbf247

Update comments: // Flag: find Output node


          All visited nodes should be flagged.

fa3ed85


          Use shared_ptr to wrapper subGraph.

1e9a327

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

maxnick maxnick requested changes

EgorDuplensky EgorDuplensky approved these changes

yuxu42 Awaiting requested review from yuxu42

ceciliapeng2011 Awaiting requested review from ceciliapeng2011

as-suvorov Awaiting requested review from as-suvorov

Requested changes must be addressed to merge this pull request.

Labels