You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe we could at least do block copy operations instead of computing the position of every element individually. Instead of constructing an array of scalar indexes, construct an array of ranges for each element to handle inner, initially constructing only one of the outer copies. Then that data just needs to be copied to handle outer. The case of inner being all 1s can be optimized as well.
Writing the loops manually is 100x faster.
Unfortunately I can't think of any easy way to fix the general case that doesn't involve generated functions.
The text was updated successfully, but these errors were encountered: