Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary GCD #755

Open
wants to merge 76 commits into
base: master
Choose a base branch
from
Open

Binary GCD #755

wants to merge 76 commits into from

Conversation

erik-3milabs
Copy link
Contributor

This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.

Upsides to this technique:

  • it is up to 27x faster than the gcd algorithm currently implemented in crypto_bigint (see below) (really, it just has a different complexity bound).
  • does not need UNSAT_LIMBS
  • it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).

Benchmark results

Word = u64

limbs gcd (vt) gcd (ct) new_gcd (ct)
2 10.687 µs 20.619 µs 3.6090 µs
4 29.121 µs 56.433 µs 7.1124 µs
8 99.819 µs 195.02 µs 16.184 µs
16 359.39 µs 710.26 µs 44.294 µs
32 1.6804 ms 3.3097 ms 136.49 µs
64 6.9717 ms 13.028 ms 494.16 µs
128 29.099 ms 57.325 ms 2.3335 ms
256 143.22 ms 244.89 ms 8.7722 ms

Copy link

@dolevmu dolevmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. I find the implementation constant time. Offered a few annotations and documentations to clarify the code after going over it, and suggested some changes to boost performance. I think it is worth reading and considering, but I approve this.

Great work thanks @erik-3milabs !

Comment on lines 30 to 31
// Todo: tweak this threshold
if LIMBS < 8 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but looks like you answered it, your problem is that the answer may depend on the machine. Suppose I gave you an answer, do you know what to do in the code? If I say "this is optimal for 64-bit architecture and that for 32-bit".. If so, let's check on another machine.

b = Uint::select(&b, &b.wrapping_sub(&a), b_odd);
matrix.conditional_subtract_top_row_from_bottom(b_odd);

// Div b by two and double the top row of the matrix when a, b ≠ 0.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Algorithm 2, you multiply by 2 always, but I suppose you know what you are doing and he is wrong on this edge case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it multiplies the second row and not the first one (f1,g1) is multiplied by two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is a modification I made to the algorithm. It prevents integer overflows in an upcoming PR. I've updated the docstring of this function to indicate this.
  2. yes, but this code has a and b swapped, so I should be doubling the first row ;-)

Note: that doubling problem was fixed because I reverted the (a, b) swap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, this is wrong: this leads to buggy behavior for specific pairs of numbers. I'm refactoring some code to bypass this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #761

@@ -25,6 +25,12 @@ impl<const LIMBS: usize> Uint<LIMBS> {
Uint { limbs }
}

/// Swap `a` and `b` if `c` is truthy, otherwise, do nothing.
#[inline]
pub(crate) const fn conditional_swap(a: &mut Self, b: &mut Self, c: ConstChoice) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of my previous comment, it is worth to consider if we can efficiently select between two vectors or matrices, and similarly swap columns/rows efficiently (which you did implement).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri do you have any ideas on improving the performance of this and other swapping operations?

Uint::conditional_swap(&mut a, &mut b, do_swap);

// subtract b from a when a is odd
a = a.wrapping_sub(&Uint::select(&Uint::ZERO, &b, a_odd));
Copy link
Contributor Author

@erik-3milabs erik-3milabs Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri what do you think of this line? Previously, it was like this:

a = Uint::select(&a, &a.wrapping_sub(&b), a_odd);

The current code is 25-10% faster for Uints with few limbs (1, 2, 3, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there's that much of a difference. Are you sure it's always faster or is it faster depending on a?

Copy link
Contributor Author

@erik-3milabs erik-3milabs Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confident it is always faster.

The reason I think this is faster, is that we are now selecting between a constant and a variable, instead of two variables. Given that select is const and loves to be inlined, the compiler can now optimize the select operation.

Recall, Uint::select calls Limb::select, which in turn calls

impl ConstChoice {
    /// Return `b` if `self` is truthy, otherwise return `a`.
    #[inline]
    pub(crate) const fn select_word(&self, a: Word, b: Word) -> Word {
        a ^ (self.0 & (a ^ b))
    }
}

When a is the constant ZERO, this can be optimized as:

        self.0 & b

saving two XOR operations, or 2/3's of this operation.

Returning to the gcd subroutine, this select is in the hot loop of this algorithm. In total, the loop executes:

  • Uint::is_odd (1 op)
  • Uint::lt (2 ops/word),
  • ConstChoice::and (1 op),
  • Uint::wrapping_sub (4 ops/word),
  • Uint::select (3 ops/word -> 1 ops/word)
  • Uint::shr (3 ops/word)

So, there is a reduction from 12 to 10 ops/word, or a 17% improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a is the constant ZERO

Aah, ok

@erik-3milabs erik-3milabs mentioned this pull request Feb 14, 2025
@tarcieri
Copy link
Member

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

@erik-3milabs
Copy link
Contributor Author

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

@tarcieri While I could still modify the Gcd trait to use this algorithm, this PR does not yet introduce the tools necessary to replace InvMod.

Aside from that, what else would be required to see this PR merged?

@tarcieri
Copy link
Member

tarcieri commented Mar 11, 2025

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

Also seems it needs a rebase due to upstream changes.

@erik-3milabs
Copy link
Contributor Author

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

You're right. This PR only introduces the gcd algorithm; PR #761 extends the algorithm into xgcd. Stripping some things from the xgcd algorithm gives invmod. Given that I don't need invmod myself, I am not too keen on implementing it 🙈

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

I agree that having two algorithms is overkill. Let me see about implementing this for Boxed<X> as well.

Also seems it needs a rebase due to upstream changes.

Yeah, you're right. Let me address that right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants