Faster const-time normalization #1028

peterdettman · 2021-12-03T13:58:30Z

Use a "borrowing" trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn't that important.

Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.

robot-dreams

This is a really nice idea! I just had one inline question.

The updated versions of fe_normalize and fe_normalizes_to_zero indeed look faster, and a local benchmark run seems to agree:

Benchmarks before

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

field_normalize               ,     0.00968   ,     0.00973   ,     0.00983
group_add_affine              ,     0.189     ,     0.196     ,     0.205
ecdsa_sign                    ,    22.3       ,    22.7       ,    23.9

Benchmarks after

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

field_normalize               ,     0.00752   ,     0.00756   ,     0.00763
group_add_affine              ,     0.189     ,     0.189     ,     0.191
ecdsa_sign                    ,    22.3       ,    22.4       ,    22.5

I also wrote down some notes as I was trying to understand the change, and I'll post them here in case they're useful for future reference.

Why it's not possible for the result to be in [P, 2^256)

If there's no final carry bit, then the second pass will subtract 2^256 - P from a value that's at most 2^256 - 1, resulting in a value that's at most P - 1.

If there was a carry, the largest possible value (not including the carry bit) is 65536 * 0x1000003D1 - 1 for 5x52 and 1024 * 0x1000003D1 - 1 for 10x26, both of which are well below P.

(If there was a carry bit, clearing it is equivalent (mod P) to decrementing by 2^256 - P, which cancels out the increment before the first pass.)

Where the 63-bit shift (in the 5x52 version) came from

The 64th bit is nonzero if and only if there was underflow after subtraction (2^256 - P is only a 33-bit value, so after subtracting it it's not possible to underflow so much that the 64th bit is still cleared afterwards).

Thus subtracting t >> 63 is the same logic as "if subtraction on the previous limb caused an underflow then subtract 1, otherwise subtract 0".

If we store a 52-bit value in a 64-bit word, underflowing it and then keeping the bottom 52 bits is equivalent to underflowing the same value stored in a 52-bit word.

(An analogous explanation works for where the 31-bit shift came from in the 10x26 case.)

robot-dreams · 2021-12-05T12:03:59Z

src/field_5x52_impl.h

+ /* Reduce t4 at the start so there will be at most a single carry from the first pass.
+ * x is incremented before the first pass and then decremented before the second pass
+ * to ensure that the result doesn't fall into the range [P, 2^256). */
+ uint64_t x = (t4 >> 48) + 1; t4 &= 0x0FFFFFFFFFFFFULL;

 /* The first pass ensures the magnitude is 1, ... */
 t0 += x * 0x1000003D1ULL;


Given that x is bigger than before, how we know that t0 += x * 0x1000003D1ULL won't overflow the uint64_t?

Most, if not all, callers use field elements with a maximum magnitude of 8, so there's plenty of room. The practical limit is actually around 31 (because field_10x26 can only represent magnitude 32 and several methods assume there's a free bit available for carries on entry). I think it would be a good idea to actually enforce some exact magnitude limit on input to field methods though (probably 16).

robot-dreams

ACK 4bbb661

My only concern from before was magnitude / overflow. I agree it'd be a good idea to enforce some exact magnitude limit and/or more carefully audit maximum possible magnitudes. But I don't think it should block this PR—especially since I also tried running tests with VERIFY_CHECK(r->magnitude <= 16) at the top of fe_normalize and it passed.

robot-dreams reviewed Dec 5, 2021

View reviewed changes

robot-dreams reviewed Dec 6, 2021

View reviewed changes

peterdettman force-pushed the opt_normalize branch 2 times, most recently from 2a7b0e1 to a44bc96 Compare December 23, 2021 13:11

sipa mentioned this pull request Dec 31, 2021

Try a non-uniform group law (e.g., for ecmult_gen)? #1051

Open

peterdettman force-pushed the opt_normalize branch from a44bc96 to 4c08c59 Compare January 1, 2022 06:14

real-or-random mentioned this pull request Jan 4, 2022

Separate magnitude/normalization/... checking/propagation from implementations #1060

Closed

Faster const-time normalization

8932956

peterdettman force-pushed the opt_normalize branch from 4c08c59 to 8932956 Compare April 20, 2022 05:42

real-or-random added the performance label May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster const-time normalization #1028

Faster const-time normalization #1028

peterdettman commented Dec 3, 2021

robot-dreams left a comment

robot-dreams Dec 5, 2021

peterdettman Dec 5, 2021

robot-dreams left a comment

Faster const-time normalization #1028

Are you sure you want to change the base?

Faster const-time normalization #1028

Conversation

peterdettman commented Dec 3, 2021

robot-dreams left a comment

Choose a reason for hiding this comment

Benchmarks before

Benchmarks after

Why it's not possible for the result to be in [P, 2^256)

Where the 63-bit shift (in the 5x52 version) came from

robot-dreams Dec 5, 2021

Choose a reason for hiding this comment

peterdettman Dec 5, 2021

Choose a reason for hiding this comment

robot-dreams left a comment

Choose a reason for hiding this comment