-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM] support new udot/sdot patterns #7800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending green.
We should figure out how/where to document this; the work on apps/hannk showed that using the dotprod ops on arm could be a huge win, but they were tricky to generate (ie, the IR had to be just right) -- would be good to document what to do to get these.
Additionally: maybe we should add a dot_prod()
intrinsic to IROperator.h, along with widening_mul
and friends? Recognizing the patterns is highly desirable of course, but having something that means "always use the best ops specifically for this regardless of architecture" seems like it could be useful in pathological cases.
@steven-johnson I completely agree - unfortunately, it’s often quite difficult to express exactly what pattern will make it through the simplifier and trigger these patterns (that’s why I’ve included so many pattern variants). @abadams and I have discussed having a strong normalization pass to ease the job of the pattern matcher (and the programmer aiming for the dot product instruction). I think this is something we need to do (but I don’t currently have bandwidth for). With regards to a new intrinsic - the main difficulty there is that we need a variadic dot product, and there’s not good consistency across backends (i.e. ARM has 4-way matching signed dot product, x86 has 2-way mixed-sign dot product, HVX has many). I worry that an intrinsic like that might be too hard to handle across backends. |
Yeah, I hear you, but telling people that you have to look at the generated assembly code to verify you got it right is an unreasonably onerous burden. |
That’s completely fair! I think the best solution is a powerful normalizer, but could be convinced about the intrinsic. |
Dot product instructions reduce a vector horizontally, and our front-end language doesn't have vectors, so there's no intrinsic we could add that would hit it guaranteed in the way you want. We do have vectors in the scheduling language, so the way to get a dot product instruction for sure is by using atomic().vectorize(some_rvar). The kind of dot product AJ is targetting here is an opportunistic instruction selection trick where you save a few instructions by interleaving four different widening multiply adds and using udot instead of running them separately. Whether or not it's a win is very very architecture and type dependent. |
Failures appear unrelated |
udot and sdot can still be used even if we need to interleave the arguments (and is faster than a string of
smull
/saddw
s). This PR adds those patterns + tests (and a few fly-by FindIntrinsics fixes).