-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__multi3 optimization #4077
Comments
Hi @shamatar -- thanks for raising this issue! I agree that the lack of "architecture-specific acceleration" for wide arithmetic lowered into Wasm bytecode is suboptimal and annoying. On the other hand, somehow recognizing that a Wasm module's internal function happens to be the I think that there is a way around this though: we could pattern-match the actual operations and use the faster architecture-specific lowering when possible. In this specific case, a pattern could match whatever subgraphs correspond to the fast x64 instructions ( The place to put such lowerings would be in the ISLE DSL (see e.g. cranelift/codegen/src/isa/x64/lower.isle); we'd be happy to help if you want to try to implement something here. |
Hey @cfallin Thank you for a detailed response. I was only mentioning As for "state of the art" - the best one can get is I'm not use that a file you have mentioned is a right place as
__multi3 , and emit just imul x y for it + inline into callsite. May be it also requires more than one pass, but the first step is to get to just imul x y
|
That's true, one way this could work in the future is to recognize the whole function and replace it with an Maybe the right way to start is to draw the dataflow graph of the bytecode you listed above, and find subgraphs that correspond to what Eventually we will also have an inliner (it's really an orthogonal improvement) and when we do, it could combine with good-enough pattern matching on the function body here to produce an inlined sequence of instructions. In other words, getting to a tight sequence of instructions, and inlining that at the callsite, are two separate problems so let's solve each separately. cc @abrown and @jlb6740 for thoughts on the x86-64 lowering here... |
I'll make a dataflow, just need some time |
More precisely a rule
Would match u64 * u64 multiplication, but I'm also not sure if this rule is about getting low bits only (half width mul) or all of them (full width) as it doesn't reflect anything about return type.
|
(Catching up on this thread...) For context, the code behind the WAT that you posted is probably coming from somewhere like this in LLVM: https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/multi3.c. There is quite a bit of shifting and masking there which, as we can see in the WAT, get translated directly to WebAssembly. I'd be interested to understand what library is being used and how you're using it (e.g., how come To optimize this in Wasmtime, I suspect the bottom-up approach (i.e., attempting to get rid of the extra shifting and masking) is going to be "good enough" relative to the code sequence currently emitted by Clang in native code (@cfallin has that) and will be less brittle than trying to match the entire function or the called name. But another option might be to attempt to optimize this at the LLVM layer: perhaps it does not need to be as literal about how it translates the C code I linked to to the WAT @shamatar posted. |
I've made a minimal example of the typical
both the commented code and one using more explicit operations provides the same assembly. This is a typical way to do big integer math on CPU. SIMD optimizations are possible, but a separate field of the art due to non-carry SIMD, non-widening SIMD, and CPU frequency quircks if SIMD is used e.g. on Intel. So my example is an "average case" and solving exactly this part would be the most beneficial (I'm actually not in the need to have a speedup in this problem for any production code, but it's a nice free time task) So compiler fits
Note: it's actually only possible to replace the way how |
I should add that even though it may be possible to fine tune LLVM too (even further from my world), it's still not possible to generate WASM code that would be better than internals of |
It's potentially both! We're still in the midst of translating all of the basic lowering into ISLE, so that is most of what we have. But the design goal is absolutely to allow arbitrarily complex patterns, and we hope to grow the library of patterns we match and optimize over time. I took a look at the gcc output on x86-64 for an
so it should be possible to do a lot better here, and without recent extensions like mulx/adx. |
Yes, it's kind of possible due to two carry chains (in my example of So my example's "optimal" code is like
where |
Since cranelift supports i128, maybe we could perform clif-to-clif transforms to generate a i128 mul (if a backend wants that). That way we can avoid each backend having to add complicated matching patterns. Plus, at least for aarch64, there's already a lot of existing support for i128. |
I'm still trying to understand pattern-matching and have it (at least fragile) to capture the full body of |
I've tried to insert an optimization into cranelift directly. I should be able to match the full body of the generated |
Made an initial functional PR, folds from
into
Very naive benchmark gives around 30% speedup |
Few comments:
|
Yes, the dce pass. Note that lowering to the backend specific ir already does this implicitly. Also I think you shouldn't replace the instructions with nops. It results in invalid clif ir (as nop doesn't return any values) and it is incorrect if any of the instruction return values are used elsewhere. For example because you matched a function that looks somewhat like __multi3 but not exactly.
No, there isn't. It is also non-trivial to implement as currently every function is independently lowered to clif ir, optimized and codegened to native code. Wasmtime even compiles multiple functions in parallel. |
Ty @bjorn3 I've analyzed my NOPing approach and it's overzealous indeed. As for inlining - are there any plans to add it? It would require some form of "synchronization point" after CLIF generation, then inlining itself can also be parallelized, and then it's again a parallel native codegen again |
I've been investigating what we can do to improve the performance of Part of the solution proposed in earlier discussion here is function inlining. @elliottt and I discussed this some yesterday and what I'd like to see is a separate tool that supports an inlining transformation on core WebAssembly. I'd then recommend that people use that to eliminate the optimization barriers around One difficult problem is that this function returns a two-member Now, what is this function actually computing, and what patterns can we recognize and transform in the mid-end? Raw CLIF for __multi3, slightly edited
This function implements a 128-bit multiply where the operands and result are stored as pairs of 64-bit integers. It builds up the result, in part, using a series of 32x32->64 multiplies. Viewed as a sequence of 32-bit "digits", the result of long multiplication should look like this, although each single-digit product may have a carry into the column to its left:
However only a1/a0 and b1/b0 are actually treated this way, to compute the lower half of the result along with part of the upper half.
In Cranelift, we'd want to implement this with a pair of instructions: The remaining part of the upper half is just
There's actually nothing we can improve in this part, as shown in @cfallin's gcc-generated assembly listing, which has two So back to the lower-half 64x64->128 multiply, performed 32 bits at a time. Our current optimizer produces this sequence which is equivalent to
Then I think this part (which reuses some of the intermediate values above) is equivalent to either
If we can get egraph rules to simplify these two sequences to the upper and lower halves of a 64-bit multiply, then once we also solve #5623 we'll get the same sequence of instructions out that gcc produces for 128-bit multiplies. Alternatively, if we could match on multiple results at once then we could write a rule that matches this combination of three |
LLVM's `__multi3` function works by splitting a wide multiplication into several narrower ones. This optimization recognizes the algebraic identities involved and merges them back into the original wide multiply. This is not yet done but illustrates how part of the optimization can work, at least. Currently, the lower half of the result is optimized into a single `imul` instruction, but most of the intermediate values that are optimized away there are still used in computing the upper half, so elaboration brings them back later. Fixes bytecodealliance#4077
Feature
Implement an optimization pass that would eliminate the
__multi3
function from WASM binary during JIT by replacing it with ISA specific (mainly forx86_64
andarm64
) sequences, and then inline such sequences into callsites that would allow further optimizationsBenefit
A lot of code dealing with cryptography would benefit form faster full width
u64
multiplications where such__multi3
arisesImplementation
If someone would give a few hints about where to start I'd try to implement it by myself
Alternatives
Not that I'm aware of. Patching into calling come native library function is a huge overhead for modern CPUs (4 cycles for
x86_64
for e.g.mulx
ormul
), and while it would be faster most likely, it's still far from optimal case on a hot pathAs an example a simple multiply-add-carry function like
a*b + c + carry -> (high, low)
that accumulates intou128
without overflows compiles down to the listing below, and it can be a good test subject (transformed fromwasm
intowat
, may be not the best readable)The text was updated successfully, but these errors were encountered: