Use Co-Z arithmetic for precomputations #41

peterdettman · 2014-06-25T12:51:07Z

Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
Co-Z methods used for point precomputations in secp256k1_ecmult.
WINDOW_A size increased to 6 since the precomputation is much faster per-point.
DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
From 2.4% to 3.8% faster 'bench' results, depending on configuration.

(UPDATE: Technique explicitly described in https://eprint.iacr.org/2008/051)

It's probably best to have a read of the paper (http://joye.site88.net/papers/GJMRV11regpm.pdf) if not familiar with Co-Z arithmetic, but roughly it's based on the observations that

points with equal Z coordinate (Co-Z) can be added very cheaply.
an input to a Co-Z double or add can be updated to have the same Z coordinate as the result, almost for free, allowing "chaining" of operations.

This works out well for generating a table of 1P, (2P), 3P, 5P, etc... as the 2P can be created with the same Z as 1P, then added to 1P, 3P, etc., in each case being updated to have the same Z as the previous output, which is also the next input. Then the total cost for a table of 8 values is 3M+4S + 7(5M+2S) = 38M+18S, compared to 3M+4S + 7(12M+4S) = 87M+32S. Alternatively, as this PR does, a table of 16 values can be built instead for 3M+4S + 15(5M+2S) = 78M+34S, which is still cheaper than the current table generation; then ~6 point additions are saved during the scalar multiplication (72M+24S), for a total saving of 72M+24S + 87M+32S - (78M+34S) = 81M+22S, or ~100M.

Timing results for 'bench' across a variety of configurations (bignum=gmp throughout):

Field	Endo?	Before	After	%Perf
64bit	yes	1m10.623s	1m8.147s	+3.6%
64bit	no	1m35.524s	1m32.660s	+3.1%
32bit	yes	1m43.756s	1m39.973s	+3.8%
32bit	no	2m21.871s	2m17.690s	+3.0%
gmp	yes	1m41.278s	1m37.710s	+3.7%
gmp	no	2m24.717s	2m21.326s	+2.4%

gmaxwell · 2014-06-27T01:01:42Z

If this gives a fast point tripling..., I seem to recall seeing some work using multi-base numbers— e.g. where a scalar is expressed as sum(2^n_x_n) + sum(3^n_y_n), and there was some fast encoding into that overcomplete basis. The result of doing so tended to render the numbers much more sparse (more zeros) and thus save a lot of additions.

peterdettman · 2014-06-27T06:32:13Z

@gmaxwell See e.g https://eprint.iacr.org/2008/285.pdf . It might be worth trying, but it looks like the results would be marginal (vs 5-NAF or especially 6-NAF above). Tripling with the co-z formulas above is just dblu/zaddu, which I would guess corresponds pretty closely to direct tripling formula (http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html), although you get 2P as an output too.

sipa · 2014-07-08T16:08:35Z

With unconditional 2-3% performance increase, I certainly want this, but I'll need some time to read up on the math :)

peterdettman · 2014-07-16T12:36:04Z

@sipa Let me know if you have any questions; I think I stuck fairly close to the algorithms as given in the main body of the paper, but perhaps a sign was switched here or there to reduce the number of _negate calls. It wouldn't surprise me if they could yet be improved a little; I was pretty happy with the initial results and didn't spend much time past that.

peterdettman · 2014-07-18T04:30:30Z

Patch updated to cover G precomputations also, and rebased.

sipa · 2014-08-03T22:02:27Z

Some unit tests for verifying co-Z vs normal addition/doubling would be nice; especially one that covers the doubling branch inside secp256k1_coz_zaddu (which currently has no coverage in the tests).

sipa · 2014-08-03T22:09:03Z

The code duplication between secp256k1_coz_dblu and dblu_a is painful to see. I see two solutions:

Change secp256k1_gej_t to be a secp256k1_coz_t + z coordinate + infinity flag.
Add casting from secp256k1_gej_t* to secp256k1_coz_t*

and then instead of secp256k1_coz_dblu taking coz + gej + const gej, have it take coz + coz + z + inf + const gej.

But let's keep that for later.

peterdettman · 2014-08-20T04:47:32Z

@sipa Given the current restricted usage for the coz stuff (i.e. precomputation of P, 3P, 5P... and P, 2P, 3P... sequences with P != INF), I am thinking it might be simpler for the moment to raise errors if infinities or double-via-add crop up, since they should never happen given the large order of all valid points.

In regards to the code duplication in dblu(_a) I had pretty much the same thoughts on alternatives, but I don't have a strong favourite amongst them (of course the duplication isn't great either). Can we kick that can down the road?

sipa · 2014-08-20T16:40:43Z

@peterdettman I don't like relying on relatively complex reasoning to show that a code path shouldn't be reachable, even if it just contains an assert. At least trying to have some basic testing in place that it works even when it is feel much more assuring.

sipa · 2014-08-20T16:40:57Z

I don't mind having the duplication for now.

peterdettman · 2014-08-21T14:25:53Z

@sipa OK, I'll add some test coverage in the next few days.

sipa · 2014-12-04T20:12:25Z

@peterdettman The code has been changed quite a bit lately, and you'll have to rebase again. I'm still very interested in this, but can you limit the changes to just what is necessary for runtime performance? Adding complexity to speed up the precomputation (which takes only a few milliseconds here) isn't very interesting.

peterdettman · 2014-12-11T11:12:35Z

@sipa Yeah, there's a lot of changes, but I'll try to get it rebased. I'll keep in mind the runtime/precomputation distinction, but I've been assuming we could end up using an affine precomputation in secp256k1_ecmult at some point, thus the opts for secp256k1_ecmult_table_precomp_ge_var.

gmaxwell · 2014-12-12T10:37:14Z

@peterdettman When I recently looked again at your patch I thought you actually were doing the conversion to affine for the multiply, until pieter pointed out to me that it was precomp only... so I guess that intent was clear. Have you benchmarked it with that?

I estimate from the mul/sqr counts that gej+ge should be about 30% faster than gej+gej, assuming a square takes .7x as long as a mul. W/ endormphism combined costs in the mutiply loop would be gej+ge 10.1 * 16 * 2 + gej+gej 14.8 * 16 * 2 + double 127 * 5.8 = 1533.4. vs 1383. Using all ge+ge adds... so about 10% less weighed field operations in the inner-loop. Given that saving the final inversion was about a 3% speedup, the affine could reasonably expected to be a win. esp with the known z-ratio trick. The affine points also use only 2/3rds the memory, so potentially this allows for a larger window.

peterdettman · 2014-12-13T06:27:01Z

@gmaxwell Yeah, I keep a branch around that uses affine-A in _ecmult(). I measure about %1.5 improvement for bench_verify, but:

That's the GMP inversion (unless I missed some developments); it will be hard to match it's performance.
_gej_add_ge_var is a little suboptimal still: weak normalisation PR improves it relatively, and I have some code that eliminates two _negates. It's not clear exactly how it will balance out.

Of course, for any sort of batching, it becomes an easy win, even with the exponentiation inversion.

Note that the optimal window size is 6 for co-z alone, but 5 with the inversion. Using affine precomp lowers the per-add cost by 4M+S, but raises the per-precomp cost by the same amount (and it was worse by 2M before I noticed the z-ratios trick). I calculate the theoretical break-even cost of inversion as ~131M (and our multiplication got faster recently :( ).

I tried an alternative scheme where instead of scaling the precomp x,y values by zInv (which costs 3M+S), we just record the zInv. Then a modified gej_add_ge can be used that only needs 1 extra M (i.e. 9M+3S), since we can just scale the gej.z by zInv (for purposes of scaling the "ge" x, y). Yeah, that actually does work, but it still works out slower (the math has it costing just 9M more in total, optimal at w=6).

I also tried cacheing the z2, z3 values for the precomp points (when using gej_add of course), since each point is used 2-3 times, but again this was slower in practice (partly due to copying the entire struct when negating the precomp entry, but even after I hacked around that it seems the larger memory footprint was an issue), although it might bear a second attempt in case I fluffed it somehow.

sipa · 2014-12-14T02:30:54Z

I benchmarked this after rebasing on top of #154 (and using its new normalization functions in the coz functions): an extra 3.5%-4% (with GMP enabled).

sipa · 2014-12-14T03:25:17Z

Can you add unit tests for this? In test_ge there are already many additions between random and non-random points; adding comparisons with equivalent additions using the co-z functions would be nice.

peterdettman · 2014-12-14T08:18:40Z

@sipa Just to be clear, I also measure around +3.5%-4% for this PR as is, and then a further ~1.5% if I additionally enable the affine precomp and change (back) to WINDOW_A=5. I'll get onto the tests!

- Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t. - Co-Z methods used for A and G point precomputations. - WINDOW_A size increased to 6 since the precomputation is much faster per-point. - DBLU cost: 3M+4S, ZADDU cost: 5M+2S. - Take advantage of z-ratios from Co-Z to speed up table inversion.

peterdettman · 2014-12-29T05:08:18Z

I have found a (~7 year old) paper that actually describes this precomputation technique (including the z-ratios): https://eprint.iacr.org/2008/051 . They develop it slightly further ("Scheme 2") to save 1S per precomp-point, which I've now replicated, but there's enough stacked on this PR for the moment, so I'll leave it for now.

sipa · 2015-02-11T23:20:34Z

Closing in favor of #210.

peterdettman changed the title ~~Use Co-Z arithmetic for WINDOW_A precomputations~~ Use Co-Z arithmetic for precomputations Jul 18, 2014

peterdettman force-pushed the coz-arithmetic branch from f8d6894 to be150ae Compare August 20, 2014 04:17

peterdettman force-pushed the coz-arithmetic branch 5 times, most recently from 4d569fe to 272a926 Compare November 5, 2014 13:04

peterdettman force-pushed the coz-arithmetic branch 4 times, most recently from 943bd94 to e84b4d8 Compare November 15, 2014 15:56

peterdettman force-pushed the coz-arithmetic branch 2 times, most recently from 01a56dd to 7bbba4a Compare November 19, 2014 03:12

peterdettman force-pushed the coz-arithmetic branch 2 times, most recently from d1256c6 to da142eb Compare December 12, 2014 10:29

peterdettman force-pushed the coz-arithmetic branch from da142eb to 3f07a4e Compare December 12, 2014 12:55

peterdettman force-pushed the coz-arithmetic branch from 3f07a4e to 0b4129f Compare December 13, 2014 03:29

peterdettman force-pushed the coz-arithmetic branch 2 times, most recently from b1a0a9c to 0057bcf Compare December 21, 2014 03:58

peterdettman mentioned this pull request Dec 21, 2014

Effectively affine precomp A for _ecmult without inversion! #171

Closed

peterdettman force-pushed the coz-arithmetic branch from 0057bcf to 400e0b6 Compare December 23, 2014 10:52

peterdettman force-pushed the coz-arithmetic branch from 400e0b6 to 47a7652 Compare December 24, 2014 08:51

peterdettman mentioned this pull request Dec 30, 2014

Co-Z + effective affine precomputation + tests #174

Closed

sipa mentioned this pull request Feb 11, 2015

Effective affine precomputation (by Peter Dettman) #210

Merged

sipa closed this Feb 11, 2015

sipa mentioned this pull request Feb 11, 2015

Co-Z based precomputation (by Peter Dettman) #211

Closed

peterdettman mentioned this pull request Dec 24, 2021

Try a non-uniform group law (e.g., for ecmult_gen)? #1051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Co-Z arithmetic for precomputations #41

Use Co-Z arithmetic for precomputations #41

peterdettman commented Jun 25, 2014

gmaxwell commented Jun 27, 2014

peterdettman commented Jun 27, 2014

sipa commented Jul 8, 2014

peterdettman commented Jul 16, 2014

peterdettman commented Jul 18, 2014

sipa commented Aug 3, 2014

sipa commented Aug 3, 2014

peterdettman commented Aug 20, 2014

sipa commented Aug 20, 2014

sipa commented Aug 20, 2014

peterdettman commented Aug 21, 2014

sipa commented Dec 4, 2014

peterdettman commented Dec 11, 2014

gmaxwell commented Dec 12, 2014

peterdettman commented Dec 13, 2014

sipa commented Dec 14, 2014

sipa commented Dec 14, 2014

peterdettman commented Dec 14, 2014

peterdettman commented Dec 29, 2014

sipa commented Feb 11, 2015

Use Co-Z arithmetic for precomputations #41

Use Co-Z arithmetic for precomputations #41

Conversation

peterdettman commented Jun 25, 2014

gmaxwell commented Jun 27, 2014

peterdettman commented Jun 27, 2014

sipa commented Jul 8, 2014

peterdettman commented Jul 16, 2014

peterdettman commented Jul 18, 2014

sipa commented Aug 3, 2014

sipa commented Aug 3, 2014

peterdettman commented Aug 20, 2014

sipa commented Aug 20, 2014

sipa commented Aug 20, 2014

peterdettman commented Aug 21, 2014

sipa commented Dec 4, 2014

peterdettman commented Dec 11, 2014

gmaxwell commented Dec 12, 2014

peterdettman commented Dec 13, 2014

sipa commented Dec 14, 2014

sipa commented Dec 14, 2014

peterdettman commented Dec 14, 2014

peterdettman commented Dec 29, 2014

sipa commented Feb 11, 2015