You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This will be less than the outer dim size of the Output tensor T4. However, given that the nested For-Loop should be executed exactly once ceilDiv(ceilDiv(4, 4), 4) == 1, it requires more predication as is shown in the case without broadcast.
switched to scheduleReduction instead of naive scheduleFusion for reduction-fusion;
update FusionExecutorCache to reuse kernel with ReductionParamsHash
Note:
It's failing CI test due to: #273; but luckily we have the other PR merged that disabled broadcasting, so CI is green.
🐛 Bug
This issue arises when you have Two tensors:
A real example:
Without broadcast:
With broadcast, the optimized for loop is gated by the equation:
This will be less than the outer dim size of the Output tensor T4. However, given that the nested For-Loop should be executed exactly once
ceilDiv(ceilDiv(4, 4), 4) == 1
, it requires more predication as is shown in the case without broadcast.My guess is that the broadcast tensor's predication is getting chosen for all on the optimized path because it happens to come first.
To Reproduce
You will need the file changes found in this PR: #272
Test case with broadcast:
Test case without broadcast:
The text was updated successfully, but these errors were encountered: