-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
reproducer: https://gcc.godbolt.org/z/xsTP44xvn
The two functions compute the same result, a 64x64-to-128 bit unsigned multiply. The portable version (that avoids __int128
) does fold to good code like the non-portable one on either x86 or aarch64.
The portable version is modeled on:
Since mul and umulh are cheap, there is a special advantage to
recognizing and simplify portable code that performs these
operations by parts using 32-to-64-bit multiplies.
For comparison, similar simplifications target rbit and ror.
https://gcc.godbolt.org/z/Ghb3rfP5W
Also popcnt is combined into the intrinsic from the portable code.
The existing implementations are in places like
InstCombine/InstCombineAndOrXor.cpp
and
AggressiveInstCombine/AggressiveInstCombine.cpp.
Handling the present use case would require recognition that a 128-bit multiply is happening, but based only on observations of 64-bit values. Both 64-bit values, built up from 32-bit temporary values, would get replaced by a full 128-bit combo value, which would then be split up again (or not).
It seems to me that the "bit provenance" algorithm which recognizes bit-reverses might possible be adaptable to a "scaled term" provenance algorithm that would enable a pattern matcher to verify that a nest of shift/multiply/mask nodes actually did add up to 128 bits of product.
This is my first bug report. While I do realize that "missed optimizations" are not really bugs, I did not find in the user guide a way to make a record of suggestions like this. I hope this is actionable, perhaps after moving into whatever bucket such things should actually go into.