Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cranelift: Optimize __multi3-style multiplications #8653

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jameysharp
Copy link
Contributor

LLVM's __multi3 function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply.

This is not yet done but illustrates how part of the optimization can work, at least.

Currently, the lower half of the result is optimized into a single imul instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later.

Fixes #4077

LLVM's `__multi3` function works by splitting a wide multiplication into
several narrower ones. This optimization recognizes the algebraic
identities involved and merges them back into the original wide
multiply.

This is not yet done but illustrates how part of the optimization can
work, at least.

Currently, the lower half of the result is optimized into a single
`imul` instruction, but most of the intermediate values that are
optimized away there are still used in computing the upper half, so
elaboration brings them back later.

Fixes bytecodealliance#4077
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels May 19, 2024
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@scottmcm
Copy link
Contributor

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

@jameysharp
Copy link
Contributor Author

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

Yeah, this definitely felt similar to that, so I placed my new rules next to those mulhi rules. I don't see a way to simplify these rules in terms of those, though. The hard part is recognizing that four multiplies, two adds with overflow checks, and some other instructions, all together are equivalent to a single mulhi. I suspect that by the time we can recognize all those pieces, it's easier to just emit the umulhi instruction than it is to emit a couple uextends, an imul, and an ireduce. But if you can see a way to do it more simply I'd love that!

Thinking about this comment, though, led me to realize that once we can do that, ISLE is capable enough today to transform a sequence like this:

i = umulhi a_lo, b_lo
j = imul a_lo, b_hi
k = imul a_hi, b_lo
l = imul a_hi, b_hi
iadd i, j, k, l

into something like this:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = ushr x, width/2
ireduce y

If our mid-end rules supported producing instructions with multiple results, the ushr/ireduce pair would be better as isplit, since isplit and iconcat produce no instructions for 128-bit values.

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

@afonso360
Copy link
Contributor

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

It might not be an improvement by itself, but if we recognize some other i128 ops we can probably start applying some optimization rules at the i128 level which would be really neat. For example, we could now recognize (imul.i128 x (iconst 2)) and turn it into a (iadd.i128 x x) which would be hard to do with the other sequences.

@jameysharp
Copy link
Contributor Author

Good point! Except we currently can't deal with 128-bit constants, so rules like imul x, 2 won't actually fire. We should figure out how to fix that too 😅

The other thing we can't currently do is notice that this sequence has a redundant multiply:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = imul a_lo, b_lo

@scottmcm
Copy link
Contributor

scottmcm commented May 22, 2024

@jameysharp
Copy link
Contributor Author

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with ... hopefully it'll already work 🤞

Ooh, I hadn't looked at those carefully. That's awesome that iconst_u special cases i128! However I see that it doesn't do that for the extractor, only the constructor. Right now it would need an external extractor to match an optional uextend, which is yet more motivation for something like or-patterns (#6128 or #8599).

@scottmcm
Copy link
Contributor

scottmcm commented May 22, 2024

For extend we don't need to wait on or patterns, because it could use

(decl uextend_maybe (Type Value) Value)
(extractor (uextend_maybe ty val) (uextend_maybe_etor ty val))
(rule 0 (uextend_maybe ty val) (uextend ty val))
(rule 1 (uextend_maybe ty val@(value_type ty)) val)

which is a special extractor added in #7710 for exactly the "there might be an extend" problem :)

(Though I'd love to have or patterns to stop needing custom extractors for things like this!)

EDIT: landed in #8686

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

__multi3 optimization
3 participants