cranelift: Optimize __multi3-style multiplications #8653

jameysharp · 2024-05-19T21:17:55Z

LLVM's __multi3 function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply.

This is not yet done but illustrates how part of the optimization can work, at least.

Currently, the lower half of the result is optimized into a single imul instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later.

Fixes #4077

LLVM's `__multi3` function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply. This is not yet done but illustrates how part of the optimization can work, at least. Currently, the lower half of the result is optimized into a single `imul` instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later. Fixes bytecodealliance#4077

github-actions · 2024-05-19T23:44:33Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

scottmcm · 2024-05-22T04:14:45Z

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

jameysharp · 2024-05-22T06:36:32Z

I don't know if it'd help at all, but cc https://github.com/bytecodealliance/wasmtime/pull/7719/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01 which will convert larger muls back to smaller muls and mulhis. So I think if you can recognize the open-coded multiplication and turn it into a larger cranelift multiplication, the rest might just work.

Yeah, this definitely felt similar to that, so I placed my new rules next to those mulhi rules. I don't see a way to simplify these rules in terms of those, though. The hard part is recognizing that four multiplies, two adds with overflow checks, and some other instructions, all together are equivalent to a single mulhi. I suspect that by the time we can recognize all those pieces, it's easier to just emit the umulhi instruction than it is to emit a couple uextends, an imul, and an ireduce. But if you can see a way to do it more simply I'd love that!

Thinking about this comment, though, led me to realize that once we can do that, ISLE is capable enough today to transform a sequence like this:

i = umulhi a_lo, b_lo
j = imul a_lo, b_hi
k = imul a_hi, b_lo
l = imul a_hi, b_hi
iadd i, j, k, l

into something like this:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = ushr x, width/2
ireduce y

If our mid-end rules supported producing instructions with multiple results, the ushr/ireduce pair would be better as isplit, since isplit and iconcat produce no instructions for 128-bit values.

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

afonso360 · 2024-05-22T08:21:08Z

On x86 at least, an imul.i128 currently lowers to roughly the first sequence above so I'm not sure this is an improvement, but I hadn't previously been sure it was even possible.

It might not be an improvement by itself, but if we recognize some other i128 ops we can probably start applying some optimization rules at the i128 level which would be really neat. For example, we could now recognize (imul.i128 x (iconst 2)) and turn it into a (iadd.i128 x x) which would be hard to do with the other sequences.

jameysharp · 2024-05-22T15:32:39Z

Good point! Except we currently can't deal with 128-bit constants, so rules like imul x, 2 won't actually fire. We should figure out how to fix that too 😅

The other thing we can't currently do is notice that this sequence has a redundant multiply:

a = iconcat a_hi, a_lo
b = iconcat b_hi, b_lo
x = imul a, b
y = imul a_lo, b_lo

scottmcm · 2024-05-22T15:35:03Z

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with https://github.com/bytecodealliance/wasmtime/pull/7693/files#diff-2041f67049d5ac3d8f62ea91d3cb45cdb8608d5f5cdab988731ae2addf90ef01R92 and https://github.com/bytecodealliance/wasmtime/pull/7670/files#diff-bf1f1ea991da38588a1785fc7012072e7a685411ea6a262c1fab18379411ef50R94 hopefully it actually already works 🤞

jameysharp · 2024-05-22T18:00:27Z

Since there's no 128-bit iconst it's imul.i128 x (uextend i128 (iconst 2)), but with ... hopefully it'll already work 🤞

Ooh, I hadn't looked at those carefully. That's awesome that iconst_u special cases i128! However I see that it doesn't do that for the extractor, only the constructor. Right now it would need an external extractor to match an optional uextend, which is yet more motivation for something like or-patterns (#6128 or #8599).

scottmcm · 2024-05-22T18:11:13Z

For extend we don't need to wait on or patterns, because it could use

wasmtime/cranelift/codegen/src/prelude_opt.isle

Lines 113 to 116 in f1fe2af

    
           (decl uextend_maybe (Type Value) Value) 
        
           (extractor (uextend_maybe ty val) (uextend_maybe_etor ty val)) 
        
           (rule 0 (uextend_maybe ty val) (uextend ty val)) 
        
           (rule 1 (uextend_maybe ty val@(value_type ty)) val)

which is a special extractor added in #7710 for exactly the "there might be an extend" problem :)

(Though I'd love to have or patterns to stop needing custom extractors for things like this!)

EDIT: landed in #8686

github-actions bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels May 19, 2024

scottmcm mentioned this pull request May 23, 2024

Use [su]extend_maybe in the iconst_[su] extractors #8686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift: Optimize __multi3-style multiplications #8653

cranelift: Optimize __multi3-style multiplications #8653

jameysharp commented May 19, 2024

github-actions bot commented May 19, 2024

scottmcm commented May 22, 2024

jameysharp commented May 22, 2024

afonso360 commented May 22, 2024

jameysharp commented May 22, 2024

scottmcm commented May 22, 2024 •

edited

Loading

jameysharp commented May 22, 2024

scottmcm commented May 22, 2024 •

edited

Loading

cranelift: Optimize __multi3-style multiplications #8653

Are you sure you want to change the base?

cranelift: Optimize __multi3-style multiplications #8653

Conversation

jameysharp commented May 19, 2024

github-actions bot commented May 19, 2024

Subscribe to Label Action

scottmcm commented May 22, 2024

jameysharp commented May 22, 2024

afonso360 commented May 22, 2024

jameysharp commented May 22, 2024

scottmcm commented May 22, 2024 • edited Loading

jameysharp commented May 22, 2024

scottmcm commented May 22, 2024 • edited Loading

scottmcm commented May 22, 2024 •

edited

Loading

scottmcm commented May 22, 2024 •

edited

Loading