Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework of 10x26 field mul/sqr #815

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

peterdettman
Copy link
Contributor

  • avoid overly-wide multiplications
  • save a few multiplies, masks and shifts
  • final residual left in r[9] instead of r[2]

@gmaxwell It looks faster to me, but if you could collect some results for real 32-bit hardware that would be very helpful.

- avoid overly-wide multiplications
- save a few multiplies, masks and shifts
- final residual left in r[9] instead of r[2]
@real-or-random
Copy link
Contributor

@peterdettman That's great, I didn't see the widemul optimizations even though I was staring at this code for some time.

I get the general idea but can you explain something like a more systematic approach to this? E.g., what was the algorithm you followed to create these changes? I think this helps reviewers and then we don't need to rediscover your thoughts.

@peterdettman
Copy link
Contributor Author

The overall algorithm is "schoolbook" multiplication to calculate 19 products p00..p18, with the high part reduced modulo the field prime P using multiplication by (2^256 - P), and a final carry propagation to ensure a magnitude 1 output.

Operationally, we run two accumulators, 10 limbs (260 bits of "weight") apart, and interleave the modular reduction: the least-significant limb of the upper accumulator can be accumulated into the lower accumulator after multiplying by [ R1, R0 ], where (R1 << 26) + R0 == (2^256 - P) << 4 (the << 4 is because 10 limbs is 260 bits, not 256).

The accumulation begins with p07, p17. d accumulates p07..p16, while c accumulates p17..p18 as the upper accumulator then p00..p06 as the lower accumulator. There are a couple of final carry propagation steps for c. The choice of starting accumulator positions affects where the final carry will land. There is an ulterior motive for finishing the carry propagation at r[9]: it will support an increased magnitude limit for field elements of 16 (currently 8). a[9], b[9] also have their input bounds changed to 27 bits (from 26) in anticipation of this.

Squaring is the same, modified only so that duplicate limb multiplications are consolidated when calculating the products p00..p18.

@peterdettman
Copy link
Contributor Author

I'm intending to push arbitrary-degree Karatsuba changes in this PR shortly, but I'd like to get some baseline performance numbers in place first.

@peterdettman
Copy link
Contributor Author

I'm intending to push arbitrary-degree Karatsuba changes in this PR shortly

The latest commit adds this (only relevant to mul, not sqr). It saves 45 limb multiplies, but adds the equivalent of 52 limb additions (counting 64-bit adds as 2). However that may not be the full story; from the paper:

However an inspection of the generated code also confirmed our suspicion that the ADK code generated more register-register move instructions and memory accesses. On some processors this could offset the ADK advantage.

On my 64-bit machine this is slower (as expected), but on real 32-bit hardware there's a good chance this is faster. I'd need someone to help with measuring that though. From what I can see on godbolt the choice of compiler leads to large differences in instruction count.

@peterdettman
Copy link
Contributor Author

@gmaxwell I don't want to impose on you, but I had hoped you might collect some benchmarks for this PR.

@gmaxwell
Copy link
Contributor

Woops! Will do. I'm excited about this work-- but I've been redoing all the power at home and have had systems in various states of off and disconnected and couldn't immediately try it and then forgot about it. Give me a day or two.

@peterdettman
Copy link
Contributor Author

peterdettman commented May 7, 2021

@gmaxwell Still hoping for some benchmarks here (separately for the first "rework" commit and the latter "Karatsuba" one).

If/when this is accepted I am hoping someone can update the corresponding asm to leave its final carry in the top limb (or in case Karatsuba proves fast, either abandon the asm or reimplement it along those lines).

At that point field implementations could be modified to support a maximum magnitude of 16. Some easy (but small) performance improvements then follow, and I am expecting it will give a little more room for per-group-operation magnitude limits (on input and output field elements) to be profitable.

@gmaxwell
Copy link
Contributor

gmaxwell commented May 7, 2021

@peterdettman Oh crap. I am terribly sorry for completely forgetting your request.

I've tested on two different 32-bit arm systems: A faster superscalar cortex-a9,
and a slower in order cortex-a8 (which can dual issue, but only if the instructions
are paired right).

Common hardware wallets are usually Cortex-M4, which like the a8 is an in order core
with dual issue but due to pipeline differences I'm not sure how similar the results
would be. Some use M3 which also is 3-5 cycles "depending on values" (uh oh) for
a 32x32->64 mul. The slower M0 has a 32 cycle multiply... I'm not aware of anything
using this library on those, though I suppose that karatsuba would be a huge win there.
(I'd kinda like to see those numbers!)

Results seem a little paradoxical. On a8, rework seems to slow things down and karatsuba
speeds the mul back up. On a9 rework is a slight speedup, but karatsuba is a slight slowdown (though still faster than master).

With the older GCC the rework is a huge speedup.

As the ASM (which is IIRC the same algorithm as master, just with hand scheduling and
register allocation) and the older GCC results show, the performance is really sensitive
to compiler decisions. This might make it hard to draw conclusions on the algorithms
overall. I wouldn't be surprised if the small disadvantage for karatsuba on a9 went
away with somewhat more careful scheduling.

May be of interest to @laanwj too.


i.MX53 (Cortex-a8)
gcc version 9.3.0 (GCC)

Master w/ arm asm:
field_sqr: min 0.598us / avg 0.598us / max 0.599us
field_mul: min 0.810us / avg 0.811us / max 0.812us
ecdsa_verify: min 1556us / avg 1556us / max 1557us

Master:
field_sqr: min 0.666us / avg 0.666us / max 0.668us
field_mul: min 1.10us / avg 1.10us / max 1.11us
ecdsa_verify: min 1895us / avg 1896us / max 1896us

Rework:
field_sqr: min 0.716us / avg 0.718us / max 0.728us
field_mul: min 1.18us / avg 1.18us / max 1.19us
ecdsa_verify: min 2025us / avg 2025us / max 2027us

Karatsuba:
field_sqr: min 0.715us / avg 0.715us / max 0.717us
field_mul: min 0.983us / avg 0.984us / max 0.986us
ecdsa_verify: min 1829us / avg 1829us / max 1829us


i.MX6 (Cortex-a9)
gcc version 9.3.0 (GCC)

Master w/ arm asm:
field_sqr: min 0.292us / avg 0.292us / max 0.292us
field_mul: min 0.361us / avg 0.361us / max 0.362us
ecdsa_verify: min 740us / avg 740us / max 740us

Master:
field_sqr: min 0.342us / avg 0.343us / max 0.343us
field_mul: min 0.496us / avg 0.496us / max 0.497us
ecdsa_verify: min 919us / avg 919us / max 920us

Rework:
field_sqr: min 0.329us / avg 0.330us / max 0.330us
field_mul: min 0.495us / avg 0.495us / max 0.496us
ecdsa_verify: min 906us / avg 906us / max 907us

Karatsuba:
field_sqr: min 0.331us / avg 0.331us / max 0.331us
field_mul: min 0.500us / avg 0.501us / max 0.501us
ecdsa_verify: min 910us / avg 910us / max 911us


i.MX6 (Cortex-a9)
gcc version 4.9.2 (Debian 4.9.2-10+deb8u2)

Master w/ arm asm:
field_sqr: min 0.292us / avg 0.292us / max 0.292us
field_mul: min 0.361us / avg 0.361us / max 0.361us
ecdsa_verify: min 739us / avg 739us / max 740us

Master:
field_sqr: min 0.815us / avg 0.815us / max 0.816us
field_mul: min 0.940us / avg 0.941us / max 0.942us
ecdsa_verify: min 1755us / avg 1756us / max 1757us

Rework:
field_sqr: min 0.554us / avg 0.555us / max 0.556us
field_mul: min 0.597us / avg 0.598us / max 0.599us
ecdsa_verify: min 1164us / avg 1165us / max 1171us

Karatsuba:
field_sqr: min 0.555us / avg 0.556us / max 0.557us
field_mul: min 0.662us / avg 0.665us / max 0.668us
ecdsa_verify: min 1228us / avg 1229us / max 1229us

@gmaxwell
Copy link
Contributor

gmaxwell commented May 9, 2021

Given that it was so compiler sensitive I thought it would be useful to upgrade to a newer GCC and try clang too:


Cortex-a8 gcc version 10.2.0 (GCC)
Master:
field_sqr: min 0.658us / avg 0.658us / max 0.668us
field_mul: min 1.13us / avg 1.13us / max 1.14us

Rework:
field_sqr: min 0.666us / avg 0.668us / max 0.671us
field_mul: min 1.03us / avg 1.03us / max 1.04us

Karatsuba:
field_sqr: min 0.667us / avg 0.667us / max 0.675us
field_mul: min 1.07us / avg 1.07us / max 1.08us


Cortex-a8 clang version 11.1.0
Master:
field_sqr: min 0.614us / avg 0.620us / max 0.643us
field_mul: min 1.33us / avg 1.34us / max 1.35us

Rework:
field_sqr: min 0.747us / avg 0.750us / max 0.760us
field_mul: min 1.35us / avg 1.36us / max 1.37us

Karatsuba:
field_sqr: min 0.746us / avg 0.752us / max 0.786us
field_mul: min 1.01us / avg 1.02us / max 1.03us

...chaos

@peterdettman
Copy link
Contributor Author

Thanks, @gmaxwell. Somewhat irksome results though.

In the rework I tried loading a0..a9 so that the results can be written out to r immediately, but maybe it's worth testing a variation that doesn't attempt that change (so no a0..a9, but t0..t6 restored instead and written to r only after all reads from a).

Karatsuba might do better with manual scheduling as you say, so I hope someone can try an asm version of it (well of both versions ideally).

@sipa
Copy link
Contributor

sipa commented May 9, 2023

If this isn't faster on new compilers it's unclear if we should spend time on this. @peterdettman is there a smaller set of changes that we could consider for benchmarking?

With work progressing on fiat-crypto (#1261) changes like this may become more appealing, if they can get integrated into their optimization framework, as it simplifies our job in convincing it's correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants