-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native num implementation; modular inverse and Jacobi symbol without GMP #290
Conversation
@@ -117,7 +117,7 @@ AC_ARG_ENABLE(module_schnorr, | |||
AC_ARG_WITH([field], [AS_HELP_STRING([--with-field=64bit|32bit|auto], | |||
[Specify Field Implementation. Default is auto])],[req_field=$withval], [req_field=auto]) | |||
|
|||
AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|no|auto], | |||
AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|64bit|auto], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there is a 32bit as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
5726a17
to
3caa2a1
Compare
Needs rebase. |
Need more rebase. |
Rebased. |
How is the inversion performance looking? I ask because I've implemented an inversion based on the GCD in http://perso.ens-lyon.fr/damien.stehle/downloads/recbinary.pdf (except non-recursive). Currently it does 1 million inversions in ~2.3s, which is at least %10 faster than GMP according to benchmark_internal. There's room for improvement because this is currently in Java, using 32 bit multiplies. If it is proving difficult to improve on GMP's performance using the current approach, perhaps this would be more fruitful. Due to time constraints, I think I would need help with the C conversion and secp256k1 integration though. |
@peterdettman I'm surprised that you were able to beat my code (let alone GMP's!!) with binary GCD. Nobody in this space has published benchmarks anywhere but the papers I've read have consistently implied that Euclid + Lehmer would be a bit faster because it involved fewer iterations. I did some back-of-envelope calculations that were consistent with this. And GMP doesn't use pure Euclid+Lehmer, but a recursive "half GCD" algorithm invented by Niels Möller. It is sorta described here https://www.jstor.org/stable/40234522 but the actual GMP code is heavily optimized beyond that for their data structures. |
@apoelstra I was a bit surprised myself and I suppose it may yet turn out that I've mis-measured something, but I don't believe so. The focus in most GCD papers is on improving the big-O bounds for very large numbers; at small sizes like 256 bits the performance limit of an algorithm often depends on serendipity - whether things work out "cleanly". Ultimately the proof is in the pudding I guess. Still, if you're interested, the best place to start is perhaps to just get you a copy of the Java code and have you confirm the performance comparison? |
BTW, it's strictly speaking not binary GCD, although similar. In particular it calculates exact "quotients" from the least significant bits only. Best to read the paper! |
Rebased (correctly this time, I think). @peterdettman Sure, I'm interested. You can email me at apoelstra@wpsoftware.net if this is not public code. Note that I'm also time-constrained, plus I'm much less comfortable in Java than in C. I'll read your paper as soon as I have have some cycles to work on this again. @sipa had some ideas about using Jacobi symbols for fast batch inversion, so the priority of this patch might be bumped up again.. For historic curiosity: If I remember correctly, I actually did implement a binary GCD to do some measurements. Because proper bin-GCD requires lots of left-shifting, and I was working with fixed-size buffers, and I didn't really believe in this approach enough to dedicate lots of time to it, I basically did a modular reduction after every left shift (given the functions I had access to already, and C, this was actually easier than using a variable-sized buffer :)). This of course was hundreds, if not thousands, of times slower than a properly written bin-GCD, so I could not benchmark it directly. Instead I added a bunch of trace code and simply printed out some iteration counts to get a feel for "how much work a proper bin-GCD would do". If I remember right, The result was 4-600 iterations for a random 256-bit number, vs 20-30 for Euclid-Lehmer. A bingcd iteration is a left-shift-then-subtract (so on 5-word bignums, roughly 10 word operations) while an Euclid-Lehmer iteration is multiplication of a bignum 2-vector by a 2x2 word matrix (roughly 60 operations I think). So my intuition was that bingcd was going to be slower pretty-much no matter what I did. Obviously these numbers are full of SWAGs and don't take into account compiler or CPU optimizations. :) And since your paper isn't really binary GCD maybe they're totally inapplicable. But there they are. |
Yeah, your intuition was right about that class of algorithms (Laszlo Hars has a good paper rounding up the variations: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.384.8124). About 18 months ago I tried implementing a couple of them and could never get better than very roughly half as fast as GMP at 256 bits, maybe worse. This algorithm is really more comparable to Euclid-Lehmer. It works on the least significant double-words, calculating exact "quotient" matrices (of the Generalised Binary Division, a sort of dual to plain The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9), and a 64-bit oriented one would comfortably finish in 5 iterations (fewer full-length updates, but the amount of LSB-processing doesn't change much). The raw output (of the modular inversion) can be larger than 256 bits (there's some analysis in the paper of how large the quotients can get, but 512 bits is plenty), and an adjustment is needed at the end to account for extra powers of 2 that are collected in the result (total coda cost is around 2 field multiplies on average, using a tiny precomputed table per prime). Anyway, I'll tidy things up a little and get a copy to you in the next few days, just for benchmarking in the first instance. |
(Note that the build failures are due to timeouts on the unit tests, which are due to the very bad perf on the current Jacobi symbol implementation vs GMP's.) |
How close? -1/F(-x) = 1/F(x). So if having a smaller magnitude input helps, you can shave off a bit that way. |
I collected some better stats on the rounds and it turns out it's about 0.5% 9, 98% 10 and 1.5% 11. After nine rounds there's on average about 14 bits left, so each round averages just under 27 bits. Saying its close to 9 was an exaggeration that probably stems from earlier in the process before I had implemented the "stop heuristics" correctly. Still, a few extra bits in the average case would skew things more towards 9 and away from 11. @gmaxwell In this context the smaller magnitude doesn't help, although a subtraction can get you a low order zero, which could - but the quotient calculation is already built around generating low-order zeroes and it's cheaper to just let it do its thing instead of manipulating the full values (apart from the single update per round). |
int shift_w, shift_b; | ||
secp256k1_num_word rem_high; /* This slides as rem decreases */ | ||
secp256k1_num_word div_high; /* This one stays constant */ | ||
secp256k1_num div; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaration of div
shadows a global declaration.
Fails inexplicably, but needs a trivial rebase. |
@sipa the inexplicable failures are that my Jacobi implementation is so slow compared to GMP (like 50x; they do some magic I don't yet understand to make Jacobi way faster than gcd, though the code is superficially similar) that it times out. But sure, I'll rebase this. |
I'd like to see this in, even if inefficient. Both the Jacobi construct and
the ability to have a num interface without GMP.
|
Is it OK if I add an Also I'll need to re-run kcov on this; I think I'm missing coverage for some error cases and I don't want to hurt @gmaxwell's numbers :) |
How about just dropping the iteration count by a factor 100
unconditionally? :)
And yes, not hurting the coverage numbers would be nice!
|
This function isn't used anywhere and will cause test failures if we implement the full num.h API for a fixed-width 256-bit numeric type. We lose the unit test for secp256k1_scalar_mul_shift_var; we compensate by improving the unit test for secp256k1_scalar_split_lambda (which is the only user of this function) to test that the algebraic relation `N = s_lam * lambda + s_1` actually holds for the lambda decomposition.
This num.h implementation works using fixed-size arrays large enough to hold a 256-bit number (plus one word for slop). It includes a modular inversion. Typical perf numbers on my 64-bit system are: scalar_inverse: constant time: min 13.4us / avg 13.5us / max 13.8us native num.h: min 5.18us / avg 4.55us / max 5.43us gmp num.h: min 2.65us / avg 2.68us / max 2.70us field_inverse: constant time: min 6.02us / avg 6.09us / max 6.15us native num.h: min 5.48us / avg 4.94us / max 5.68us gmp num.h: min 2.96us / avg 3.02us / max 3.09us
Rebased, and also:
|
Also add native Jacobi symbol test (Andrew) Rebased-by: Andrew Poelstra
…rd" case Originally we computed q by dividing the top words of the dividend and divisor, then corrected it by at most 2. However, the ratio of the top words can only ever be 1, so rather than having three cases (original correct; need to subtract 1; need to subtract 2), we actually only have two (1 is correct; 0 is correct). This simplifies the algorithm a little bit and gets us to full test coverage, which we didn't have before since the "correct by 2" case was impossible to hit.
Still way too slow for tests. |
Any interest in picking this up in the near future? |
Any progress on this? |
Also, is there any progress on @peterdettman's approach? @peterdettman Are there newer developments? Are you still interested in this? Any code, even in java, would be useful. |
Just to prevent duplication of work, I want to mention that in light of these results #667 (comment) And potentially a faster algorithm for calculating modulo (i.e. without doing actual division). If i'll be able to understand and implement correctly I hope it will result in faster jacobi symbol calculation :) (or at least more comparable to gmp) Edit: |
@apoelstra did you actually implement Schönhage algorithm in the end? because you mentioned it but I can't see it in the code. only lehmer's algorithm. (It's fine if you don't remember, it's 4 years ago lol) |
Related, someone who doesn't find C++ numeric code unreadable might find this codebase interesting: https://github.com/JeanLucPons/VanitySearch/blob/master/IntMod.cpp#L110 |
@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works, because I could not make sense of the paper :). I have some more experience now and maybe could get further if I tried again. So no, I did not implement it. |
I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged. |
I'm all in favor of optimizing implementations but even I didn't expect #831 to come with such large CO2 savings. |
Implement
num.h
API without GMP backend; benchmarks show modular inversion taking about 65% longer than with GMP, but much faster than the constant-time version.I will update this PR with an implementation of the Jacobi symbol computation; for now I'm pushing it in the hopes that somebody will see some more perf improvements that will let us catch up with GMP. (In fact, since we don't require arbitrary size numbers or allocation we should in principle be able to beat them..). This has a modular inverse but no Jacobi symbol.
This is required for #262 because we do not want to require a GMP dependency.
Edit: I have a paper by the author of the GMP code which describes (mostly) what they are doing. With this I am able to read the GMP code; I think I see how to optimize it for our special case. So we should hold off any detailed review until I've written new gcd code.