Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

128-bit arithmetic #1522

Open
jameysharp opened this issue Jul 22, 2024 · 26 comments
Open

128-bit arithmetic #1522

jameysharp opened this issue Jul 22, 2024 · 26 comments
Labels

Comments

@jameysharp
Copy link

Summary

Provide limited instructions for 128-bit integer arithmetic.

Motivation

Two use cases of 128-bit arithmetic known are:

  • Arbitrary-precision integer arithmetic with 64-bit limbs. Used among other places in the implementation of public-key cryptography such as RSA.
  • Efficiently checking for overflow on 64-bit arithmetic.

Some benchmarks are below but currently as-compiled to wasm today these use cases are 2-7x slower than native code today. In native code 128-bit operations use tight control over the native flags register to achieve much better performance than wasm is able to express today.

The major goal of this proposal is to close the peformance gap between native and wasm in these use cases where 128-bit arithmetic is in a hot loop.

Proposal

Add the following new instructions with a type signature of [i64 i64 i64 i64] -> [i64 i64]. Each instruction represents 128-bit integers as a pair of i64, but otherwise is identical to the corresponding narrower operation.

  • i64.add128, extended from i64.add
  • i64.sub128, extended from i64.sub
  • i64.mul128, extended from i64.mul

Example

This function does a 64-bit integer add and returns both the 64-bit result and the unsigned carry, roughly equivalent to Rust's u64::overflowing_add:

(module
  (func (param i64 i64) (result i64 i64)
        local.get 0
        i64.const 0
        local.get 1
        i64.const 0
        i64.add128
    )
)

Rationale and Alternatives

We evaluated several alternatives and so far this one appears to be the simplest to specify and implement, while also giving the best real-world performance in our initial benchmarks.

We first considered adding an "overflow flag" output to a 64-bit arithmetic instructions, such as i64.{add,sub,mul}_overflow_{u,s} of type [i64 i64] -> [i64 i32], where the i32 result is a 0/1 boolean indicating whether the addition overflowed. This style of instruction is what native platforms often have for implementing these 128-bit operations so this seemed like a possibility of exposing directly to wasm.

We also considered a "carry flag" on both input and output, such as i64.{add,sub}_with_carry_{u,s} of type [i64 i64 i32] -> [i64 i32]. This is, for example, the adc instruction on x64 and how 128-bit addition is implemented.

We found that it was nontrivial-to-difficult to generate good code from these instructions, as shown in the benchmark results below.

  • A naive implementation of the instructions repeatedly moved the carry flag between a general-purpose register and the flags register. Avoiding this movement would require relatively advanced instruction fusing. Even then, loops typically require instructions that clobber flags, so such optimizations would only apply to straightline code, which we suspect is not the common case for these operations. Our experiments were on x86 but we suspect this observation applies to most current CPU instruction sets.
  • Another difficulty is that these instructions produce multiple results, which we have observed are difficult to integrate into many optimization frameworks. The benchmarks show below that these sorts of optimizations are likely to be crucial for performance, and if not implemented the new instructions are detrimental to performance.

In contrast, with 128-bit operations the need to fuse with other instructions is much lower and it seems that "simply having support" is a pretty good sweet-spot both performance and complexity-wise.

In addition, we considered a "wide multiply" i64.mul_wide_{u,s} with type [i64 i64] -> [i64 i64] that takes two i64 inputs and produces both the low and high halves of the 128-bit result. This is the minimal operation needed for arbitrary-precision multiplication. It is straightforward to build a full 128-bit multiply from such an instruction, and the naive implementation of this instruction performed better than the i64.mul128 instruction benchmarks below. This maps closely to the x64 mul instruction as well as equivalent instructions on AArch64 for example. While this "primitive" worked for multiplication the corresponding primitive for addition/subtraction did not work out (the *_overflow_* instruction above), so we opted to add i64.mul128 instead of this instruction.

Another possible alternative is to add a native i128 type to WebAssembly. Given the relatively niche nature of 128-bit integers, however, coupled with the pretty good code compilers already emit for 128-bit operations in source languages, we see this alternative as far too heavyweight for the benefits that it could bring. By just adding a few specialized instructions for the slow use cases it's possible to get much closer to native performance than wasm is at today.

Benchmark Results

This proposal has been prototyped in LLVM, Wasmtime/Cranelift, and Rust (all just living in forks for now) with the intention of providing guidance of the direction this proposal should go in. To that effect many of the above alternatives, as well as this proposal itself, were all implemented and evaluated.

Each benchmark below frobs various custom flags in LLVM to ensure that only a certain alternative of this proposal is implemented. Output was hand-verified to contain the desired instructions and profiling confirmed that Cranelift was generating the expected instruction sequences.

Benchmarks were collected on Intel(R) Core(TM) i9-14900K on a Linux machine. Other platforms will be tested later if this proposal advances through the CG stages.

Benchmark: addition/subtraction

For this benchmark the Rust num-bigint crate was used. In the Rust source for this benchmark, each pair of limbs is first cast from u64 to u128, then added, and finally split into the low and high u64 halves. In my experience this is a common pattern for arbitrary-precision arithmetic implementations in both C and Rust.

We compiled the benchmark to native x86-64 (to use as a baseline), to current WebAssembly, and to each of three variations on the proposed extension. We compared runtime for each variant of WebAssembly against the native code baseline and report how much slower than native each variant is (lower is better):

  • current wasm - 125%
  • i64.add_overflow_u - 88%
  • i64.add_with_carry_u - 155%
  • i64.add128 (this proposal) - 18%

Here current WebAssembly starts at 125% slower than native and the 128-bit ops in this proposal were able to close that gap to 18% slower than native. Other shapes of this proposal, as naively implemented in LLVM/Cranelift, would require more advanced optimizations to get closer to native performance specifically around management of the flags register.

Benchmark: multiplication

For this benchmark the blind-sig benchmark in the Sightglass repository was selected to test. This benchmarks is bottlenecked on the __multi3 helper in compiler-rt in LLVM which is a library implementation of 128-bit multiplication compiled to WebAssembly.

Benchmarking is performed as above where numbers here are reported as slowdowns relative to native code where lower is better:

  • current wasm - 637%
  • i64.mul_wide_u - 100%
  • i64.mul128 (this proposal) - 125%

Here the 637% slower-than-native gap was closed to ~100%. The alternative instruction i64.mul_wide_u was the best performing but this is because Wasmtime is missing some easy-to-implement optimizations for full 128-bit multiplication (e.g. here the upper halves of the 128-bit numbers are always zero). We're confident with a minor amount of elbow-grease the proposed i64.mul128 instruction can be as good as i64.mul_wide_u.

Open Questions

Are there other use cases where instructions like these would be useful?

Should we introduce an i128 type and require that all integer operations be implemented for it, or stick with the most useful subset and represent values as pairs of i64, as proposed above?

Are there other motivating use cases for *_overflow_* or *_with_carry_* instructions? Ones that perhaps won't require "advanced flag optimizations" to showcase their benefit?

Meta

This proposal is a join effort of @alexcrichton (Fermyon) and @jameysharp (Fastly). We intend to start discussion with this issue and then bring a more complete proposal to a future CG meeting.

This issue is also related to historical discussions such as:

@lpereira
Copy link

lpereira commented Jul 22, 2024

From the get-go, I know of two non-cryptographic uses that might benefit from this proposal:

  1. https://lemire.me/blog/2019/03/19/the-fastest-conventional-random-number-generator-that-can-pass-big-crush/
  2. A 64-bit adaptation of https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

For both cases above, having a pair of i64s is better than a single i128, as the right-shift-by-64 is the same as "drop the lower part of the result".

@alexcrichton
Copy link

A benchmark of (1) shows that baseline wasm today is 360% slower and with the instructions in this proposal it's only 10% slower than native. For (2) I wasn't able to understand at-a-glance how to get a 128-bit multiply, but given the results seen here if it uses 128-bit multiplies I suspect it'll see good improvements too though.

@lpereira
Copy link

For (2) I wasn't able to understand at-a-glance how to get a 128-bit multiply (...)

Would need to adapt it to reduce a 64-bit value instead. Something along the lines of:

uint64_t reduce64(uint64_t x, uint64_t N) {
  return ((uint128_t) x * (uint128_t) N) >> 64;
}

@alexcrichton
Copy link

Ah I see makes sense! In that case I think it's a similar kernel as to (1) so sounds good 👍

@tlively
Copy link
Member

tlively commented Jul 23, 2024

  • Another difficulty is that these instructions produce multiple results, which we have observed are difficult to integrate into many optimization frameworks

It is a known deficiency of Binaryen that it does not handle multivalue well. I would like to get this fixed, but anything more than incremental improvements would require a fundamental change to Binaryen IR, so it's not a quick fix. Are there other frameworks you have in mind that similarly trip over multivalue results? Also, the proposed instructions also have multiple results; were you observing that they weren't as problematic for some reason?

You mentioned the possibility of adding an i128 type, but I agree that this would be more heavyweight than we want. A similar alternative would be to use the existing v128 type to represent the 128-bit integers. Having the type of the instructions be [v128 v128] -> [v128] seems more elegant to me than [i64 i64 i64 i64] -> [i64 i64]. Have you considered that possibility? If so, what are its downsides?

@jameysharp
Copy link
Author

The reason I didn't spend much time thinking about v128 for this is that a straightforward implementation of that type would always keep its values in vector registers, and on current targets we'd need to move values between there and integer registers to do arithmetic on them. We should add that to the alternatives under consideration but it seemed like a troublesome option to me at a quick glance.

Our mid-end optimization framework in Cranelift currently does not optimize multi-result instructions at all; Alex observed that LLVM has a nice table-driven code generator that only works for single-result instructions and requires writing a bunch of C++ by hand if you want to define optimization rules for multi-result instructions; and I think Alex had mentioned other compilers with similar restrictions too.

Our benchmarks showed that we could get nearly native performance with our proposed form of these instructions despite not implementing any optimizations, which is why we think they're a good choice even though they still have multiple results.

@alexcrichton
Copy link

(Jamey and I raced so consider this more background, but not adding anything new, over what he already mentioned)


Are there other frameworks you have in mind that similarly trip over multivalue results?

I was additionally considering Cranelift and LLVM here. I can clarify and say that supporting multivalue isn't necessarily the hard part, most frameworks I've seen support it one way or another. The trickiness is specifically around optimizing instructions that produce multiple values.

For LLVM I'm not deeply familiar with backends and I have no doubt that LLVM optimizes very well, but my impression is that multi-value instructions are somewhat non-standard in the sense that writing lowerings for multi-value instructions requires writing C++ instead of using *.td files:

Overall, there is no way to define or match SelectionDAG nodes that define multiple values (e.g. SMUL_LOHI, LOAD, CALL, etc). This is the biggest reason that you currently still have to write custom C++ code for your instruction selector

For Cranelift it has s-expression-based lowering backends such as this rule on x64 which avoids materializing flags into a register and then testing them for a conditional jump (contrasted with the fully general rule below which does just that). This instruction matching though only works for single-result instructions, so I was finding it difficult to figure out how to pattern-match this in rules to figure out how to optimize well. Cranelift, like LLVM, supports multi-value well enough but unlike LLVM doesn't have much support for optimizing the results.

Also, the proposed instructions also have multiple results; were you observing that they weren't as problematic for some reason?

The main issue was optimizing instructions with multiple results. Initial findings have shown that the *_overflow_* and *_add_with_carry_* variants of instructions require optimizations to avoid materializing the flags register. These optimizations were difficult to implement in Cranelift and a hypothesis is that they'll be somewhat nontrivial to implement in other compilers. The other problem was that performance was worse than it was today already if the optimizations weren't implemented, so wasm compilers will largely be required to implement these (what's predicted to be) nontrivial optimizations.

Contrasted with the proposed 128-bit instructions optimizations around multiple results weren't necessary. The "naive" lowerings of 128-bit operations were sufficient to get drastically better performance over baseline wasm today. While optimizations are still possible and compilers may also wish to implement, it doesn't seem to be as critical for good performance.

@dtig
Copy link
Member

dtig commented Jul 24, 2024

This would be a useful addition to the spec, and one we've had requests for but haven't prioritized in the past.

Do you have a sense for what the performance would look like for a naive implementation of these operations on a 32-bit platform?

@alexcrichton
Copy link

Not at this time, the current runtime used to benchmark, Wasmtime, doesn't have support for any 32-bit platform so at this point it's just hoped that implementing 128-bit ops in the host wouldn't be slower than in-wasm. I think that's definitely something to track and follow-up on if this proposal goes through the phases.

@jameysharp
Copy link
Author

As Alex said that's more difficult for us to prototype since Wasmtime/Cranelift does not currently support any 32-bit platforms, but we can look at what native code a toolchain like Rust/LLVM produces for equivalent operations today. I expect even unoptimized implementations of this proposal to be able to emit the same instruction sequences.

With rustc -O --target i686-unknown-linux-musl, addition on u128 is an add followed by three adc instructions. By comparison, on x86-64 it's an add followed by one adc instruction. So it is twice as many instructions which operate on half the width, which is the best we could hope for.

Multiplication is more complicated, naturally: on x86-64 that toolchain is producing 3 multiplications and 2 additions, while on i686 it's 10 multiplications and 14 additions. I believe both cases are optimal for general long multiplication.

What about optimizations for special cases? For example, multiplying two operands where the upper 64-bits of both are known to be zero, as would be handled by the more specialized i64.mul_wide_u alternative to this proposal. x86-64 can implement that case with a single multiply instruction, eliminating 80% of the instructions; i686 needs 4 multiplies and 6 additions, eliminating about 60% of the instructions. Implementing this optimization for the instruction we're proposing should be no more difficult on 32-bit targets than on 64-bit targets, though.

I think the naive implementation that we benchmarked for i64.add_overflow_u and i64.add_with_carry_u would probably do better on i686 than our benchmarks showed them doing on x86-64, because I think without optimizations i64.add_with_carry_u would end up lowering to four additions, just like i64.add128 would, and i64.add_overflow_u would only need three additions. So on 32-bit targets I think we'd see the alternatives always being at least as good as this proposal, in those special cases where they're applicable. However, I think our proposal is still a better option, both to avoid hurting 64-bit targets and because our proposal is more generally applicable.

This answer is hand-wavy but I hope it helps with building some intuition for how the performance of this proposal might turn out on 32-bit targets.

@dtig
Copy link
Member

dtig commented Jul 24, 2024

Thanks that helps. The thing I don't have an intuition for is in the context of the benchmark shared, on platforms where the 64-bit values will have to be lowered to 32-bit registers, and for multiple values - how much of a factor is register pressure and how does it contribute to the performance? I do agree with you though that having the abstraction is better than not.

@dschuff
Copy link
Member

dschuff commented Jul 24, 2024

Thanks for working on this, I am happy to see this addition as well!
Another use case for add-with-overflow is the one that motivates the checked arithmetic builtins in C compilers: namely the ability to guard against overflow for application integrity or security without invoking undefined behavior.
Today you can of course implement this overflow checking for 32-bit types e.g. by extending to 64-bit, doing the operation, and then checking the result; but IIUC it requires more than one branch for the signed case, implementing it for 64 bit types requires still more branches.

It seems this use case might actually be a better match for the more direct "overflow flag" approach, since in most cases you really do just immediately want the value of the flag to branch on rather than to feed it into another arithmetic operation.
Having the 128-bit math form would be an improvement for the 64-bit case, making it more or less equivalent to the 32-bit case today. But since the overflow flag is directly available under the hood, I suspect it wouldn't be optimal.

@jameysharp
Copy link
Author

@dtig, I think it'll be difficult to compare register pressure between 32-bit and 64-bit targets. If the source program needs a 128-bit multiply, that's going to require a certain minimum number of registers on a given machine, no matter which instructions we offer and regardless of what optimizations a runtime provides. Register pressure is only relevant if we choose an alternative that makes it harder to reach that minimum number of registers, such as if we offer an instruction that is a poor fit for what programmers are trying to do.

Currently, for 128-bit multiplies LLVM produces a function call into compiler-rt (so caller-saved registers have to be spilled to memory) that does several 32-bit multiplies and several 64-bit multiplies, and passes results back through linear memory on the shadow stack. Compared to that, any alternative that keeps all values in the minimum number of registers is going to perform significantly better, on every platform whether it's 32-bit or not.

Ideally we'd compare the code from a reasonable wasm runtime against what the native toolchain produces for the same architecture. Alex and I aren't prepared to do that for 32-bit targets right now, but we'd love help with that evaluation and we'll dig into this more as the proposal develops.

@dschuff, I've been fascinated by what code LLVM's wasm backend currently generates for getting an overflow flag for both i32 and i64 addition. It doesn't widen, and it doesn't branch either, assuming you can implement i64.lt_{s,u} without branching on your target platform, as x86 can. For unsigned addition overflow, it just checks if the result is less than one of the operands. For signed addition, it checks that as well as whether the other operand was negative, and xors the two checks.

If we add the "overflow flag" alternative then we can reduce both patterns to a single instruction each, which might be good for wasm code density if these operations are used frequently. That said, I currently suspect that optimizing the native code generated for these patterns is already not too difficult without adding new instructions, since there's data-flow connecting all the relevant parts.

So I'm currently uncertain whether there's a strong justification for extending the instruction set for those particular operations. But we haven't yet found a benchmark where this pattern is performance-critical and we'd appreciate any suggestions!

@dschuff
Copy link
Member

dschuff commented Jul 25, 2024

Hm, good point about the comparing the result to the operands; I was thinking of what checked-math libraries typically do in C, where that's exactly what you can't do because of UB.

I would think that basically any code compiled with UBSan would exercise this pretty heavily; or if you want to limit to just integer overflow, compile with -ftrapv (which I have also seen in used production/release code).

-ftrapv code is full of sequences like this:

start:
  %2 = call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 %0, i32 %1)
  %3 = extractvalue { i32, i1 } %2, 1
  br i1 %3, label %trap, label %cont
trap:
  ; call a sanitizer runtime function
  unreachable
cont: 
  %4 = extractvalue { i32, i1 } %2, 0
  ret i32 %4 ; or otherwise continue
}

On wasm today we have something like

# %bb.0:                                # %start
	block
	i32.const	$push0=, 0
	i32.lt_s	$push1=, $1, $pop0
	i32.add 	$push5=, $0, $1
	local.tee	$push4=, $1=, $pop5
	i32.lt_s	$push2=, $pop4, $0
	i32.eq  	$push3=, $pop1, $pop2
	br_if   	0, $pop3                        # 0: down to label0
# %bb.1:                                # %trap
	unreachable
.LBB1_2:                                # %cont
	end_block                               # label0:
	local.copy	$push6=, $1
                                        # fallthrough-return

The 64-bit version has a similar sequence for add. It uses a compiler-rt call for multiply, and I think it could use the proposed 128-bit multiply in its current form similarly.

On x86, those sequences reduce to an add or mul, followed by a jo, which would still be a runtime improvement over what I'm guessing wasm engines will produce today for the version with explicit compares. It is possible to pattern-match this "user-space" overflow-checking code back into explicit-overflow bits (I believe clang may actually do that), but of course that's one of those questions about whether that kind of analysis should go into the toolchain vs the engine, etc.

I wonder if it would be worth looking through various requests we've had for these features in the past, and see if there were other concrete use cases mentioned.

@ppenzin
Copy link

ppenzin commented Jul 26, 2024

  • Arbitrary-precision integer arithmetic with 64-bit limbs. Used among other places in the implementation of public-key cryptography such as RSA.

There has been a proposal to standardize cryptographic operations, but it didn't progress as browsers have native crypto which doesn't need wasm support, I really suspect it would be the same here.

From the get-go, I know of two non-cryptographic uses that might benefit from this proposal:

1. https://lemire.me/blog/2019/03/19/the-fastest-conventional-random-number-generator-that-can-pass-big-crush/

2. A 64-bit adaptation of https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

How often are these particular algorithms used? NumPy is using SFMT if I am not mistaken.

@alexcrichton
Copy link

alexcrichton commented Jul 26, 2024

@dschuff for some comparisons (hah!) these are some godbolt links showing some various steps of overflowing arithmetic and comparisons (using Rust as I'm more comfortable with that than C but it's the same LLVM IR)

This is also just comparing Wasmtime's native output to LLVM's, not other wasm compilers'. For the unsigned case the code is "pretty good" but uses lea + cmp instead of just an add to check for overflow. For the signed case the Wasmtime generated code is far behind the single add LLVM emits.

Given this I suspect that the best way to evaluate the performance of this would be to find a benchmark in wasm bottlenecked on signed integer overflow. The two related issues I found in this repo were #1021 and #1472. I suspect it's correct that overflow flags and saturating flags and such are much easier with *_overflow_* instructions than it is with pattern matching today (although @jameysharp is a bit more optimistic than I as well about being able to optimize this well in Wasmtime).

Thanks for the tip about -ftrapv (reminds me as well about -Coverflow-checks in Rust). I suspect porting ubsan to wasm would be a bit more involved than just making this fast but being able to turn on these might lend itself well to "find an arithmetic-heavy benchmark and turn on overflow flags".


@ppenzin while I agree that the motivation for browsers to assist with improving wasm-based crypotography is low due to the existence of JS APIs both @jameysharp and I work on Wasmtime which is an out-of-browser WebAssembly runtime which doesn't have the ability to use such APIs as a fallback.

@alexcrichton
Copy link

alexcrichton commented Jul 26, 2024

I did a small benchmark today. I took the wasm-tools repository which has a few benchmarks for parsing/validating wasm files. I compiled the benchmarks to native and WebAssembly with and without -Coverflow-checks (the Rust equivalent of -ftrapv). I then compared the wasm versions to their native counterpart, and then compared the wasm versions together.

Overall on this particular benchmark, with Wasmtime, wasm sees about a 50-60% slowdown relative to native. With -Coverflow-checks enabled the slowdown is 5-10% higher. For example when validating the integemm-simd module Wasmtime is 62% slower than native when both native and wasm do not have overflow checks. When both wasm and native have overflow checks enabled Wasmtime is 68% slower. (this is all with baseline wasm today)

That's perhaps a gut-check that overflow checks and -ftrapv may not be likely to impact most runtimes by default and a better selection of a benchmark to more greatly show the effects of overflow overhead is required to better evaluate the instructions.

@lpereira
Copy link

lpereira commented Jul 31, 2024

  1. https://lemire.me/blog/2019/03/19/the-fastest-conventional-random-number-generator-that-can-pass-big-crush/
  2. A 64-bit adaptation of https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

How often are these particular algorithms used? NumPy is using SFMT if I am not mistaken.

The Lwan web server uses (2) in a perfect hash table generator, and (1) in its PRNG, FWIW.

@ppenzin
Copy link

ppenzin commented Aug 13, 2024

The Lwan web server uses (2) in a perfect hash table generator, and (1) in its PRNG, FWIW.

I've looked into Lwan in the past, it is pretty cool. What is the relationship to wasm here though - is it expected to run in wasm?

One more comment in general here - wasmtime is just one standalone engine, @alexcrichton @jameysharp do you know if the same perf gap would persist in other engines? For example engines that use LLVM (WAMR & Wasmer with that backend turned on), or even something like wasm2c+clang?

@alexcrichton
Copy link

A new repository for this now-phase-1 proposal has been created at https://github.com/WebAssembly/128-bit-arithmetic (thanks Thomas!)

I'm in the process of moving information over to there. I'm going to try to capture open questions and routes-to-investigate in GitHub issues and the current state of the proposal in Overview.md. Right now it's a bunch of "TODO" comments but I hope to fill it in over this week.

Interested parties from this issue are encouraged to browse the repository, open issues, etc. I'll come back to this issue once all the various content has been moved over and this can be closed.

@alexcrichton
Copy link

Ok the repository has been filled out with an overview in WebAssembly/wide-arithmetic#8, issues are opened at https://github.com/WebAssembly/128-bit-arithmetic, and there's some more discussion of overflow flags at WebAssembly/wide-arithmetic#10.

@jameysharp if you're ok with it I think it'd be ok to close this issue now in favor of the proposal repository.

@waterlens
Copy link

waterlens commented Aug 23, 2024

  • A naive implementation of the instructions repeatedly moved the carry flag between a general-purpose register and the flags register. Avoiding this movement would require relatively advanced instruction fusing. Even then, loops typically require instructions that clobber flags, so such optimizations would only apply to straightline code, which we suspect is not the common case for these operations. Our experiments were on x86 but we suspect this observation applies to most current CPU instruction sets.
  • Another difficulty is that these instructions produce multiple results, which we have observed are difficult to integrate into many optimization frameworks. The benchmarks show below that these sorts of optimizations are likely to be crucial for performance, and if not implemented the new instructions are detrimental to performance.

In contrast, with 128-bit operations the need to fuse with other instructions is much lower and it seems that "simply having support" is a pretty good sweet-spot both performance and complexity-wise.

In addition, we considered a "wide multiply" i64.mul_wide_{u,s} with type [i64 i64] -> [i64 i64] that takes two i64 inputs and produces both the low and high halves of the 128-bit result. This is the minimal operation needed for arbitrary-precision multiplication. It is straightforward to build a full 128-bit multiply from such an instruction, and the naive implementation of this instruction performed better than the i64.mul128 instruction benchmarks below. This maps closely to the x64 mul instruction as well as equivalent instructions on AArch64 for example. While this "primitive" worked for multiplication the corresponding primitive for addition/subtraction did not work out (the *_overflow_* instruction above), so we opted to add i64.mul128 instead of this instruction.

Just for your information, being consistent with your experiments, existing compilers usually do badly in keeping carry flags in different iterations of a loop. And that's why GNU MP Bignum Library still maintains a lot of assembly code in their repo(amd64, arm64) for simple big number addition. Actually, the costs are avoidable because it's possible to use instructions that don't destroy the flags to continue the loop. It would be a great victory if someday we could add the missing optimizations for keeping the carry flags between loops, especially in a compiler backend for WebAssembly.

The 128-bit arithmetic instructions are good alternatives to these low-level primitives (*_overflow_*, *_with_carry_*, *mul_wide_*). But it may also obscure the produced code if the user code does want to use these low-level primitives (It's quite similar to the case of risc-v which requires extra shiftings, compared with x86 and aarch64). It eventually would require more sophisticated pattern matching to optimize. That's why I think providing low-level primitives are more preferable approach (of course, it also requires extra optimization to work, but it's the only correct approach to make it comparable in performance on par with native code in GMP).

@rossberg
Copy link
Member

I agree with @waterlens. Bignums are a feature of various languages these days, and libraries like GMP or LibTomMath are used in many other cases as well. It would be a bummer to invent overly specialised (non-CPU) primitives that happen to work for 128 bit but are useless for implementing larger number types.

@waterlens
Copy link

I agree with @waterlens. Bignums are a feature of various languages these days, and libraries like GMP or LibTomMath are used in many other cases as well. It would be a bummer to invent overly specialised (non-CPU) primitives that happen to work for 128 bit but are useless for implementing larger number types.

Thanks for supporting my opinion! But I want to clarify that i128 operations are certainly helpful in improving performance of big number operations. However, it's not that good from a semantics perspective. A high-performance compiler must identify that "Oh, this program used i128 operations, but because some of the inputs are constant 0 or 1, I now know it actually wants some other operations for 64-bit integers.", which sounds weird.

@alexcrichton
Copy link

FWIW I've opened WebAssembly/wide-arithmetic#6 for further discussion on flags-vs-not and the overview now has a section explaining in more detail the rationale for not choosing overflowing operations at this time.

One perhaps interesting point is that if GMP, even today, maintains handwritten assembly despite high-quality compilers like LLVM/GCC that would imply to me that the space for representing the desired operations in low-level primitives has been relatively well explored and the conclusion was that handwritten assembly is still the best option. Given that while GMP could certainly continue to write handwritten instructions it would be extra important that wasm engines compile to the exact desired native instructions which, given my investigation so far, will likely be significantly difficult.

@mratsim
Copy link

mratsim commented Jan 6, 2025

One perhaps interesting point is that if GMP, even today, maintains handwritten assembly despite high-quality compilers like LLVM/GCC that would imply to me that the space for representing the desired operations in low-level primitives has been relatively well explored and the conclusion was that handwritten assembly is still the best option.

GCC is not a high-quality compiler for big integer / cryptography and has consistently been lagging behind for years behind Clang on carry flag handling (see mratsim/constantine#357). Big integer code compiled with GCC can easily be 25% slower than Clang.

This may change with C23 and the internal refactoring needed to efficiently support _BitInt for builtin C wide-integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests