-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU]whisper readvalue optimize #26130
base: master
Are you sure you want to change the base?
[CPU]whisper readvalue optimize #26130
Conversation
Profile each node execute time. Support Static and Dynamic infer. Signed-off-by: xipingya <xiping.yan@intel.com>
If reset is not called, these marked nodes also desn't need to be executed. Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
decoder network: 20ms -> 5 ms. Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
569f53f
to
dd5dd8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
…r this. 2: Fix ReadValueAssignTest fail issue, just make sure "initOptimalPrimitiveDescriptor" don't change original primitive. Signed-off-by: xipingya <xiping.yan@intel.com>
…toolkit#26819) ### Details: - *Pattern: QKV_Reshape -> QKV_Transpose -> SDPA->OUT_Transpse->OUT_Reshape* - *Fuse this pattern to: SDPA* - *This hotspot can be observed after openvinotoolkit#26130, this PR's implementation doesn't depend on it.* ### Tickets: - *153616* --------- Signed-off-by: xipingya <xiping.yan@intel.com>
…aph input memory. avoid data corruption. Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xipingya <xiping.yan@intel.com>
MemoryInput(const std::shared_ptr<ov::Node>& op, const GraphContext::CPtr ctx); | ||
MemoryInput(const std::string id, | ||
const std::string& name, | ||
const std::string& type, | ||
const Shape& output_shape, | ||
const ov::element::Type& output_prc, | ||
const GraphContext::CPtr context, | ||
const ov::optional<std::vector<Shape>>& input_shape, | ||
const ov::optional<std::vector<ov::element::Type>>& input_prc, | ||
std::shared_ptr<ov::Model> func, | ||
mode mode = mode::read_value_assign); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you define MemoryInput
constructors. Please remove the constructor inheritance (the line above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
using MemoryInputBase::MemoryInputBase;
private: | ||
std::shared_ptr<ov::Model> body = nullptr; | ||
ov::intel_cpu::Graph subGraph; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using unique_ptr
, to reduce the size of the MemoryInput
object, as we don't always need to allocate memory for subGraph
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because body
is also used in other places, and getSubGraph() also return it to outside.
If we use unique_ptr, it cannot be copied or assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking not about the body
but about the subGraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I found cpu plugin only support to CPP 11, std::make_unique is enabled since CPP 14.
If I upgrade CPP 11 to 14, it will have a big impact.
So I use shared_ptr first.
|
||
if (haveSubgraph() && isDynamic) { | ||
// Update to MemInpSingleShapeInfer | ||
shapeInference = PassThroughShapeInferFactory(body).makeShapeInfer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that this is correct? The input subgraph being fused into MemoryInput, may produce shapes different from the Parameters attached to MemoryInput. In this case we should use InternalDynShapeInferFactory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said: "The input subgraph being fused into MemoryInput, may produce shapes different from the Parameters attached to MemoryInput".
This is my reason of adding this shape infer.
https://github.com/xipingyan/openvino/blob/b621364e611bf7304a2d286fb914d296f06a4ae5/src/plugins/intel_cpu/src/shape_inference/shape_inference_pass_through.hpp#L36-L49
we need to run m_body->validate_nodes_and_infer_types();
in order to get final real shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't. We can postpone this step till execute. This is how the internal dynamic nodes work. Please take a look at the composite node also. This additional shape inference, which you introduced, adds additional unnecessary overhead as it's bein run every execute call while we really need the output shapes from the subgraph only when its processing is required (only when state is reset), in all the other cases we retrieve the output shape from the state itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it introduces unnecessary shape infer if there is no reset state. I will try to move them to runtime.
subGraph.Init(body, context, graphInputConfig, graphOutputConfig); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a sanity check that the subgraph's actual input/output primitive descriptors are compatible with the ones from the MemoryInput node config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did other test, I found I seem not should add sanity check for input and output primitive.
In order to avoid insert reorder in outside of ReadValueWithSubgraph, inner subgraph input use ReadValueWithSubgraph parent primitives, and inner subgraph output uses ReadValueWithSubgraph child primitive. They maybe should be not compatible.
INTERNAL_OP_SCOPE(intel_cpu_ReadValueWithSubgraphNode_clone_with_new_inputs); | ||
|
||
check_new_args_count(this, new_args); | ||
auto op = std::make_shared<ov::intel_cpu::ReadValueWithSubgraph>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we pass the variable here?
...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.hpp
Outdated
Show resolved
Hide resolved
...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.cpp
Outdated
Show resolved
Hide resolved
...ns/intel_cpu/src/transformations/cpu_opset/common/pass/move_readvalue_inputs_to_subgraph.cpp
Outdated
Show resolved
Hide resolved
if (node->get_output_target_inputs(0).size() == 0u) { | ||
found_output = true; | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense to store the whole DFS subtree, since this is a path to the output node, so checking if a child node is a part of such a subtree may speed up subsequent dfs calls.
In the current implementation we store only nodes visited in reverse_dfs, in the worst case, when we have 1/2N nodes of the ReadValue path all of them parents of the same subgraph leading to an output, so we will have 1/2N * 1/2N iterations, which is still O(N^2). But, if we store the output path as well, all the nodes will be visited once, giving O(N) complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. all visited nodes also should be flagged. Updated.
…n the future, it may be deleted. Signed-off-by: xipingya <xiping.yan@intel.com>
2: Adopt parent configuration, avoid to insert reorder before the MemoryInput. 3: move prepare param to runDynamic, because it is not called each time.
It is responsibility of MemoryInputSingle or MemoryOutput
Add: CPU_GRAPH_OPTIMIZER_SCOPE(DropRedundantMemoryOutput_SubGraph); before create edge, call graph.RemoveEdge(parentEdge); Signed-off-by: xipingya <xiping.yan@intel.com>
…litply parents edges. Signed-off-by: xipingya <xiping.yan@intel.com>
Update comments: // Flag: find Output node
Details:
ReadValueWithSubgraph
node.ReadValue
's initial subgraph nodes toReadValueWithSubgraph
ReadValueWithSubgraph
toMemoryInput
Init
andActivate
of ov::intel_cpu::Graph, avoid to memory copy. Refer: [CPU] Introduce SubModel op and Composite node #25385Tickets: