Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster const-time normalization #1028

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

peterdettman
Copy link
Contributor

Use a "borrowing" trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn't that important.

Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.

Copy link
Contributor

@robot-dreams robot-dreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice idea! I just had one inline question.


The updated versions of fe_normalize and fe_normalizes_to_zero indeed look faster, and a local benchmark run seems to agree:

Benchmarks before

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

field_normalize               ,     0.00968   ,     0.00973   ,     0.00983
group_add_affine              ,     0.189     ,     0.196     ,     0.205
ecdsa_sign                    ,    22.3       ,    22.7       ,    23.9

Benchmarks after

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

field_normalize               ,     0.00752   ,     0.00756   ,     0.00763
group_add_affine              ,     0.189     ,     0.189     ,     0.191
ecdsa_sign                    ,    22.3       ,    22.4       ,    22.5

I also wrote down some notes as I was trying to understand the change, and I'll post them here in case they're useful for future reference.

Why it's not possible for the result to be in [P, 2^256)

If there's no final carry bit, then the second pass will subtract 2^256 - P from a value that's at most 2^256 - 1, resulting in a value that's at most P - 1.

If there was a carry, the largest possible value (not including the carry bit) is 65536 * 0x1000003D1 - 1 for 5x52 and 1024 * 0x1000003D1 - 1 for 10x26, both of which are well below P.

(If there was a carry bit, clearing it is equivalent (mod P) to decrementing by 2^256 - P, which cancels out the increment before the first pass.)

Where the 63-bit shift (in the 5x52 version) came from

The 64th bit is nonzero if and only if there was underflow after subtraction (2^256 - P is only a 33-bit value, so after subtracting it it's not possible to underflow so much that the 64th bit is still cleared afterwards).

Thus subtracting t >> 63 is the same logic as "if subtraction on the previous limb caused an underflow then subtract 1, otherwise subtract 0".

If we store a 52-bit value in a 64-bit word, underflowing it and then keeping the bottom 52 bits is equivalent to underflowing the same value stored in a 52-bit word.

(An analogous explanation works for where the 31-bit shift came from in the 10x26 case.)

/* Reduce t4 at the start so there will be at most a single carry from the first pass.
* x is incremented before the first pass and then decremented before the second pass
* to ensure that the result doesn't fall into the range [P, 2^256). */
uint64_t x = (t4 >> 48) + 1; t4 &= 0x0FFFFFFFFFFFFULL;

/* The first pass ensures the magnitude is 1, ... */
t0 += x * 0x1000003D1ULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that x is bigger than before, how we know that t0 += x * 0x1000003D1ULL won't overflow the uint64_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most, if not all, callers use field elements with a maximum magnitude of 8, so there's plenty of room. The practical limit is actually around 31 (because field_10x26 can only represent magnitude 32 and several methods assume there's a free bit available for carries on entry). I think it would be a good idea to actually enforce some exact magnitude limit on input to field methods though (probably 16).

Copy link
Contributor

@robot-dreams robot-dreams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 4bbb661

My only concern from before was magnitude / overflow. I agree it'd be a good idea to enforce some exact magnitude limit and/or more carefully audit maximum possible magnitudes. But I don't think it should block this PR—especially since I also tried running tests with VERIFY_CHECK(r->magnitude <= 16) at the top of fe_normalize and it passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants