Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Co-Z arithmetic for precomputations #41

Closed

Conversation

peterdettman
Copy link
Contributor

  • Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
  • Co-Z methods used for point precomputations in secp256k1_ecmult.
  • WINDOW_A size increased to 6 since the precomputation is much faster per-point.
  • DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
  • From 2.4% to 3.8% faster 'bench' results, depending on configuration.

(UPDATE: Technique explicitly described in https://eprint.iacr.org/2008/051)

It's probably best to have a read of the paper (http://joye.site88.net/papers/GJMRV11regpm.pdf) if not familiar with Co-Z arithmetic, but roughly it's based on the observations that

  1. points with equal Z coordinate (Co-Z) can be added very cheaply.
  2. an input to a Co-Z double or add can be updated to have the same Z coordinate as the result, almost for free, allowing "chaining" of operations.

This works out well for generating a table of 1P, (2P), 3P, 5P, etc... as the 2P can be created with the same Z as 1P, then added to 1P, 3P, etc., in each case being updated to have the same Z as the previous output, which is also the next input. Then the total cost for a table of 8 values is 3M+4S + 7(5M+2S) = 38M+18S, compared to 3M+4S + 7(12M+4S) = 87M+32S. Alternatively, as this PR does, a table of 16 values can be built instead for 3M+4S + 15(5M+2S) = 78M+34S, which is still cheaper than the current table generation; then ~6 point additions are saved during the scalar multiplication (72M+24S), for a total saving of 72M+24S + 87M+32S - (78M+34S) = 81M+22S, or ~100M.

Timing results for 'bench' across a variety of configurations (bignum=gmp throughout):

Field Endo? Before After %Perf
64bit yes 1m10.623s 1m8.147s +3.6%
64bit no 1m35.524s 1m32.660s +3.1%
32bit yes 1m43.756s 1m39.973s +3.8%
32bit no 2m21.871s 2m17.690s +3.0%
gmp yes 1m41.278s 1m37.710s +3.7%
gmp no 2m24.717s 2m21.326s +2.4%

@gmaxwell
Copy link
Contributor

If this gives a fast point tripling..., I seem to recall seeing some work using multi-base numbers— e.g. where a scalar is expressed as sum(2^n_x_n) + sum(3^n_y_n), and there was some fast encoding into that overcomplete basis. The result of doing so tended to render the numbers much more sparse (more zeros) and thus save a lot of additions.

@peterdettman
Copy link
Contributor Author

@gmaxwell See e.g https://eprint.iacr.org/2008/285.pdf . It might be worth trying, but it looks like the results would be marginal (vs 5-NAF or especially 6-NAF above). Tripling with the co-z formulas above is just dblu/zaddu, which I would guess corresponds pretty closely to direct tripling formula (http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html), although you get 2P as an output too.

@sipa
Copy link
Contributor

sipa commented Jul 8, 2014

With unconditional 2-3% performance increase, I certainly want this, but I'll need some time to read up on the math :)

@peterdettman
Copy link
Contributor Author

@sipa Let me know if you have any questions; I think I stuck fairly close to the algorithms as given in the main body of the paper, but perhaps a sign was switched here or there to reduce the number of _negate calls. It wouldn't surprise me if they could yet be improved a little; I was pretty happy with the initial results and didn't spend much time past that.

@peterdettman peterdettman changed the title Use Co-Z arithmetic for WINDOW_A precomputations Use Co-Z arithmetic for precomputations Jul 18, 2014
@peterdettman
Copy link
Contributor Author

Patch updated to cover G precomputations also, and rebased.

@sipa
Copy link
Contributor

sipa commented Aug 3, 2014

Some unit tests for verifying co-Z vs normal addition/doubling would be nice; especially one that covers the doubling branch inside secp256k1_coz_zaddu (which currently has no coverage in the tests).

@sipa
Copy link
Contributor

sipa commented Aug 3, 2014

The code duplication between secp256k1_coz_dblu and dblu_a is painful to see. I see two solutions:

  • Change secp256k1_gej_t to be a secp256k1_coz_t + z coordinate + infinity flag.
  • Add casting from secp256k1_gej_t* to secp256k1_coz_t*

and then instead of secp256k1_coz_dblu taking coz + gej + const gej, have it take coz + coz + z + inf + const gej.

But let's keep that for later.

@peterdettman
Copy link
Contributor Author

@sipa Given the current restricted usage for the coz stuff (i.e. precomputation of P, 3P, 5P... and P, 2P, 3P... sequences with P != INF), I am thinking it might be simpler for the moment to raise errors if infinities or double-via-add crop up, since they should never happen given the large order of all valid points.

In regards to the code duplication in dblu(_a) I had pretty much the same thoughts on alternatives, but I don't have a strong favourite amongst them (of course the duplication isn't great either). Can we kick that can down the road?

@sipa
Copy link
Contributor

sipa commented Aug 20, 2014

@peterdettman I don't like relying on relatively complex reasoning to show that a code path shouldn't be reachable, even if it just contains an assert. At least trying to have some basic testing in place that it works even when it is feel much more assuring.

@sipa
Copy link
Contributor

sipa commented Aug 20, 2014

I don't mind having the duplication for now.

@peterdettman
Copy link
Contributor Author

@sipa OK, I'll add some test coverage in the next few days.

@peterdettman peterdettman force-pushed the coz-arithmetic branch 5 times, most recently from 4d569fe to 272a926 Compare November 5, 2014 13:04
@peterdettman peterdettman force-pushed the coz-arithmetic branch 4 times, most recently from 943bd94 to e84b4d8 Compare November 15, 2014 15:56
@peterdettman peterdettman force-pushed the coz-arithmetic branch 2 times, most recently from 01a56dd to 7bbba4a Compare November 19, 2014 03:12
@sipa
Copy link
Contributor

sipa commented Dec 4, 2014

@peterdettman The code has been changed quite a bit lately, and you'll have to rebase again. I'm still very interested in this, but can you limit the changes to just what is necessary for runtime performance? Adding complexity to speed up the precomputation (which takes only a few milliseconds here) isn't very interesting.

@peterdettman
Copy link
Contributor Author

@sipa Yeah, there's a lot of changes, but I'll try to get it rebased. I'll keep in mind the runtime/precomputation distinction, but I've been assuming we could end up using an affine precomputation in secp256k1_ecmult at some point, thus the opts for secp256k1_ecmult_table_precomp_ge_var.

@peterdettman peterdettman force-pushed the coz-arithmetic branch 2 times, most recently from d1256c6 to da142eb Compare December 12, 2014 10:29
@gmaxwell
Copy link
Contributor

@peterdettman When I recently looked again at your patch I thought you actually were doing the conversion to affine for the multiply, until pieter pointed out to me that it was precomp only... so I guess that intent was clear. Have you benchmarked it with that?

I estimate from the mul/sqr counts that gej+ge should be about 30% faster than gej+gej, assuming a square takes .7x as long as a mul. W/ endormphism combined costs in the mutiply loop would be gej+ge 10.1 * 16 * 2 + gej+gej 14.8 * 16 * 2 + double 127 * 5.8 = 1533.4. vs 1383. Using all ge+ge adds... so about 10% less weighed field operations in the inner-loop. Given that saving the final inversion was about a 3% speedup, the affine could reasonably expected to be a win. esp with the known z-ratio trick. The affine points also use only 2/3rds the memory, so potentially this allows for a larger window.

@peterdettman
Copy link
Contributor Author

@gmaxwell Yeah, I keep a branch around that uses affine-A in _ecmult(). I measure about %1.5 improvement for bench_verify, but:

  • That's the GMP inversion (unless I missed some developments); it will be hard to match it's performance.
  • _gej_add_ge_var is a little suboptimal still: weak normalisation PR improves it relatively, and I have some code that eliminates two _negates. It's not clear exactly how it will balance out.

Of course, for any sort of batching, it becomes an easy win, even with the exponentiation inversion.

Note that the optimal window size is 6 for co-z alone, but 5 with the inversion. Using affine precomp lowers the per-add cost by 4M+S, but raises the per-precomp cost by the same amount (and it was worse by 2M before I noticed the z-ratios trick). I calculate the theoretical break-even cost of inversion as ~131M (and our multiplication got faster recently :( ).

I tried an alternative scheme where instead of scaling the precomp x,y values by zInv (which costs 3M+S), we just record the zInv. Then a modified gej_add_ge can be used that only needs 1 extra M (i.e. 9M+3S), since we can just scale the gej.z by zInv (for purposes of scaling the "ge" x, y). Yeah, that actually does work, but it still works out slower (the math has it costing just 9M more in total, optimal at w=6).

I also tried cacheing the z2, z3 values for the precomp points (when using gej_add of course), since each point is used 2-3 times, but again this was slower in practice (partly due to copying the entire struct when negating the precomp entry, but even after I hacked around that it seems the larger memory footprint was an issue), although it might bear a second attempt in case I fluffed it somehow.

@sipa
Copy link
Contributor

sipa commented Dec 14, 2014

I benchmarked this after rebasing on top of #154 (and using its new normalization functions in the coz functions): an extra 3.5%-4% (with GMP enabled).

@sipa
Copy link
Contributor

sipa commented Dec 14, 2014

Can you add unit tests for this? In test_ge there are already many additions between random and non-random points; adding comparisons with equivalent additions using the co-z functions would be nice.

@peterdettman
Copy link
Contributor Author

@sipa Just to be clear, I also measure around +3.5%-4% for this PR as is, and then a further ~1.5% if I additionally enable the affine precomp and change (back) to WINDOW_A=5. I'll get onto the tests!

- Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
- Co-Z methods used for A and G point precomputations.
- WINDOW_A size increased to 6 since the precomputation is much faster per-point.
- DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
- Take advantage of z-ratios from Co-Z to speed up table inversion.
@peterdettman
Copy link
Contributor Author

I have found a (~7 year old) paper that actually describes this precomputation technique (including the z-ratios): https://eprint.iacr.org/2008/051 . They develop it slightly further ("Scheme 2") to save 1S per precomp-point, which I've now replicated, but there's enough stacked on this PR for the moment, so I'll leave it for now.

@sipa
Copy link
Contributor

sipa commented Feb 11, 2015

Closing in favor of #210.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants