-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[PASS] Layout transform pass #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return false; | ||
} | ||
|
||
inline LayoutInfo GetLayout(const nnvm::OpMap<FTVMLayoutInfo>& layouts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a function called CombineLayout? vector - > vector
LayoutInfo olayout = GetLayout(olayouts, in, e.index); | ||
LayoutInfo ilayout = GetLayout(ilayouts, n, idx); | ||
if (IsPair(olayout, ilayout)) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other inputs that might need layout change? Is break right way ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, should be continue
if (olayouts.count(e.node->op())) { | ||
LayoutInfo layout = GetLayout(olayouts, e.node, e.index); | ||
nnvm::NodePtr tnode = | ||
CreateLayoutTransformNode(layout.src, layout.dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always assert output transform back to NCHW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic does not take benefit from transform cache. Maybe consider change the logic to, always eagerly create the output transform node(from its producer) and the consumer choose whether to consume it
tnode->inputs.emplace_back(e); | ||
transformed.emplace(e, nnvm::NodeEntry{tnode, 0, 0}); | ||
} | ||
new_node->inputs[idx] = transformed.at(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do an if else logic might be faster and more clear here
CreateLayoutTransformNode(layout.src, layout.dst); | ||
tnode->inputs.emplace_back(nnvm::NodeEntry{new_node, i, 0}); | ||
transformed.emplace( | ||
nnvm::NodeEntry{n, i, 0}, nnvm::NodeEntry{tnode, 0, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one problem is the version field, if we create transformed item eagerly, we have to assume it as 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to cache NodeEntry, only Node is needed, so you can still copy the node, and copy the entry index over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One node has multiple output layouts, so one node is corresponding with multiple transform node. So I think it should be a mapping from NodeEntry to NodeEntry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nerver mind, I got it
* add simplify * remove simplify in auto complete
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. (Note: This required updating `BlockBuilder` to also update its binding map for `MatchShape` nodes; that was arguably a bug.) Additionally, `MatchShape` bindings where the `LHS` and the `RHS` are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
) Current codegen output `(half4)*(device uint*)A` tries to create a `int32` number and then cast it to `half4`, which is not the expected behavior. As Metal supports `uchar4` and `char4` types, we can direct use them to solve that problem. (cherry picked from commit 6198c7f)
* Merge TL Update * submodule update * Re-implement macro with sub function. * lint fix * Refactor tensor core memory allocation in MatmulFineGrainScheduler - Adjusted the local fragment sizes for tensor core memory allocation in the MatmulFineGrainScheduler class. - Updated the allocation sizes for A_local, B_local, and C_local variables based on the new fragment sizes. - The changes ensure efficient memory utilization and improve performance. Refactor tensor core memory allocation in MatmulDequantizeFineGrainedScheduler - Modified the fragment sizes for tensor core memory allocation in the MatmulDequantizeFineGrainedScheduler class. - Updated the allocation sizes for A_frag, B_frag, and C_frag variables based on the new fragment sizes. - The changes optimize memory usage and enhance the efficiency of the dequantization process. Refactor tensor core memory allocation in MatmulDequantizeWeightPropagationScheduler - Adjusted the fragment sizes for tensor core memory allocation in the MatmulDequantizeWeightPropagationScheduler class. - Updated the allocation sizes for A_frag, B_frag, B_dequantize_frag, and C_frag variables based on the new fragment sizes. - The changes improve memory utilization and optimize the weight propagation process. * Implement int4 tensorcore * lint fix * support uint2->uint4 fast dequantize * Support int4 tensorcore decoding * lint fix
No description provided.