-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Co-Z arithmetic for precomputations #41
Conversation
If this gives a fast point tripling..., I seem to recall seeing some work using multi-base numbers— e.g. where a scalar is expressed as sum(2^n_x_n) + sum(3^n_y_n), and there was some fast encoding into that overcomplete basis. The result of doing so tended to render the numbers much more sparse (more zeros) and thus save a lot of additions. |
@gmaxwell See e.g https://eprint.iacr.org/2008/285.pdf . It might be worth trying, but it looks like the results would be marginal (vs 5-NAF or especially 6-NAF above). Tripling with the co-z formulas above is just dblu/zaddu, which I would guess corresponds pretty closely to direct tripling formula (http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html), although you get 2P as an output too. |
With unconditional 2-3% performance increase, I certainly want this, but I'll need some time to read up on the math :) |
@sipa Let me know if you have any questions; I think I stuck fairly close to the algorithms as given in the main body of the paper, but perhaps a sign was switched here or there to reduce the number of _negate calls. It wouldn't surprise me if they could yet be improved a little; I was pretty happy with the initial results and didn't spend much time past that. |
Patch updated to cover G precomputations also, and rebased. |
Some unit tests for verifying co-Z vs normal addition/doubling would be nice; especially one that covers the doubling branch inside secp256k1_coz_zaddu (which currently has no coverage in the tests). |
The code duplication between secp256k1_coz_dblu and dblu_a is painful to see. I see two solutions:
and then instead of secp256k1_coz_dblu taking coz + gej + const gej, have it take coz + coz + z + inf + const gej. But let's keep that for later. |
f8d6894
to
be150ae
Compare
@sipa Given the current restricted usage for the coz stuff (i.e. precomputation of P, 3P, 5P... and P, 2P, 3P... sequences with P != INF), I am thinking it might be simpler for the moment to raise errors if infinities or double-via-add crop up, since they should never happen given the large order of all valid points. In regards to the code duplication in dblu(_a) I had pretty much the same thoughts on alternatives, but I don't have a strong favourite amongst them (of course the duplication isn't great either). Can we kick that can down the road? |
@peterdettman I don't like relying on relatively complex reasoning to show that a code path shouldn't be reachable, even if it just contains an assert. At least trying to have some basic testing in place that it works even when it is feel much more assuring. |
I don't mind having the duplication for now. |
@sipa OK, I'll add some test coverage in the next few days. |
4d569fe
to
272a926
Compare
943bd94
to
e84b4d8
Compare
01a56dd
to
7bbba4a
Compare
@peterdettman The code has been changed quite a bit lately, and you'll have to rebase again. I'm still very interested in this, but can you limit the changes to just what is necessary for runtime performance? Adding complexity to speed up the precomputation (which takes only a few milliseconds here) isn't very interesting. |
@sipa Yeah, there's a lot of changes, but I'll try to get it rebased. I'll keep in mind the runtime/precomputation distinction, but I've been assuming we could end up using an affine precomputation in secp256k1_ecmult at some point, thus the opts for secp256k1_ecmult_table_precomp_ge_var. |
d1256c6
to
da142eb
Compare
@peterdettman When I recently looked again at your patch I thought you actually were doing the conversion to affine for the multiply, until pieter pointed out to me that it was precomp only... so I guess that intent was clear. Have you benchmarked it with that? I estimate from the mul/sqr counts that gej+ge should be about 30% faster than gej+gej, assuming a square takes .7x as long as a mul. W/ endormphism combined costs in the mutiply loop would be gej+ge 10.1 * 16 * 2 + gej+gej 14.8 * 16 * 2 + double 127 * 5.8 = 1533.4. vs 1383. Using all ge+ge adds... so about 10% less weighed field operations in the inner-loop. Given that saving the final inversion was about a 3% speedup, the affine could reasonably expected to be a win. esp with the known z-ratio trick. The affine points also use only 2/3rds the memory, so potentially this allows for a larger window. |
da142eb
to
3f07a4e
Compare
3f07a4e
to
0b4129f
Compare
@gmaxwell Yeah, I keep a branch around that uses affine-A in _ecmult(). I measure about %1.5 improvement for bench_verify, but:
Of course, for any sort of batching, it becomes an easy win, even with the exponentiation inversion. Note that the optimal window size is 6 for co-z alone, but 5 with the inversion. Using affine precomp lowers the per-add cost by 4M+S, but raises the per-precomp cost by the same amount (and it was worse by 2M before I noticed the z-ratios trick). I calculate the theoretical break-even cost of inversion as ~131M (and our multiplication got faster recently :( ). I tried an alternative scheme where instead of scaling the precomp x,y values by zInv (which costs 3M+S), we just record the zInv. Then a modified gej_add_ge can be used that only needs 1 extra M (i.e. 9M+3S), since we can just scale the gej.z by zInv (for purposes of scaling the "ge" x, y). Yeah, that actually does work, but it still works out slower (the math has it costing just 9M more in total, optimal at w=6). I also tried cacheing the z2, z3 values for the precomp points (when using gej_add of course), since each point is used 2-3 times, but again this was slower in practice (partly due to copying the entire struct when negating the precomp entry, but even after I hacked around that it seems the larger memory footprint was an issue), although it might bear a second attempt in case I fluffed it somehow. |
I benchmarked this after rebasing on top of #154 (and using its new normalization functions in the coz functions): an extra 3.5%-4% (with GMP enabled). |
Can you add unit tests for this? In |
@sipa Just to be clear, I also measure around +3.5%-4% for this PR as is, and then a further ~1.5% if I additionally enable the affine precomp and change (back) to WINDOW_A=5. I'll get onto the tests! |
b1a0a9c
to
0057bcf
Compare
0057bcf
to
400e0b6
Compare
- Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t. - Co-Z methods used for A and G point precomputations. - WINDOW_A size increased to 6 since the precomputation is much faster per-point. - DBLU cost: 3M+4S, ZADDU cost: 5M+2S. - Take advantage of z-ratios from Co-Z to speed up table inversion.
400e0b6
to
47a7652
Compare
I have found a (~7 year old) paper that actually describes this precomputation technique (including the z-ratios): https://eprint.iacr.org/2008/051 . They develop it slightly further ("Scheme 2") to save 1S per precomp-point, which I've now replicated, but there's enough stacked on this PR for the moment, so I'll leave it for now. |
Closing in favor of #210. |
(UPDATE: Technique explicitly described in https://eprint.iacr.org/2008/051)
It's probably best to have a read of the paper (http://joye.site88.net/papers/GJMRV11regpm.pdf) if not familiar with Co-Z arithmetic, but roughly it's based on the observations that
This works out well for generating a table of 1P, (2P), 3P, 5P, etc... as the 2P can be created with the same Z as 1P, then added to 1P, 3P, etc., in each case being updated to have the same Z as the previous output, which is also the next input. Then the total cost for a table of 8 values is 3M+4S + 7(5M+2S) = 38M+18S, compared to 3M+4S + 7(12M+4S) = 87M+32S. Alternatively, as this PR does, a table of 16 values can be built instead for 3M+4S + 15(5M+2S) = 78M+34S, which is still cheaper than the current table generation; then ~6 point additions are saved during the scalar multiplication (72M+24S), for a total saving of 72M+24S + 87M+32S - (78M+34S) = 81M+22S, or ~100M.
Timing results for 'bench' across a variety of configurations (bignum=gmp throughout):