-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch affine msm + generic in msm #261
Conversation
Just FYI, this is the part in our paper where we essentially said "we don't know how to build an efficient + compact scheduler", as the 5% slow down for small MSM defeats batch affine (in g1, on Intel -- I'm glad you guys found wider applications!). The issue is that keeping track of which buckets are used costs 1 mem write for each iteration of the scheduler, and that's surprisingly non-trivial (you're storing a bool, we also tried with uint to avoid reset, but similar result). For what is worth, since you can only have at most 100 bool set to true, another approach is to just keep the ids in the queue and compare with those... in hardware we're doing that because you can run all compare in parallel. In software, again, we weren't able to do it fast enough (faster than this current method, or faster than letting the queue grow without mem writes). Anyway, I just wanted to remark, either for you or for anyone else lurking, that there might be an interesting trick here to get an "easy" 5% gain :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbotrel This looks great!
regarding the TODO on
double()
andunsafeFromJacExtended()
when_p=0
normally the point at infinity lies in Z=0
. In Jacobian coordinates it is any (t^2:t^3:0)
so the convention is to take (1:1:0)
(as we do in gnark-crypto) and in projective coordinates it is any (0:t:0)
and the convention is to take (0:1:0)
. In cryptography since we don't need this point to be on the curve (because we don't use it in formulas) we just check that Z=0
. Some references take (0:1:0)
for both coordinates systems.
Because of this, the formulas in double()
outputs always Z=0
when fed with a point at infinity. Same for unsafeFromJacExtended()
(we can even rename it FromJacExtended()
since the result is always (0:0:0)
).
Derived from #249 (from @0x0ece ) . Strategy is similar, but implementation details are a bit different.
This PR:
type bucketG1AffineC4 [1 << (4 - 1)]G1Affine
) and theinnerMSM
functions are parametrized with that type. It allows for the buckets to be allocated on the stack, which is critical for perf.partitionScalars
returns a list ofdigits
as a[]uint16
slice (instead of being packed into field elements).fillBenchScalars
was not uniformly distributed. fixed.fr.Bits
) instead offr.Limbs * 64
.TODO:
partitionScalarsOld
. will create a separate issue.batchAffineAdd
methodsSome remarks on the msm-affine
Roughly speaking, the idea is to, as in the previous bucket method, process "chunks" of the scalars (think: columns) of a c-bit window size.
We can do efficient
batchAddition
(as in compute n point to point additions, NOT sum n points) in affine coordinates, but the n point to point additions must be independent; in our case, since we are adding points from the input vector to a smaller set of buckets, all we need to ensure is that during a same call tobatchAddition
, we don't add twice to the same bucket.The larger the batch size is, the less (costly) inverses we do, but we potentially put more memory/cache pressure (if it's too large) and most importantly, increase the chances of finding conflicting additions (2 points to the same bucket).
Idea is that if we consider uniformly random scalars, and take say, a batchSize of 100, with 32000 buckets, the probably to hit the same bucket twice in a "batch-window" of 100 is very low. If that happens, we append the conflicting point to a queue, and try to reprocess it later.
new: with our current parameters, the queue should stay mostly empty, and if it becomes full, we are hitting a input vector that's unfriendly for the msm-affine. This can happen in SNARK context for example if a lot of the inputs have same values, or, if we keep finding m-consecutive identical values, with m being roughly the same order as the batchSize. This would force us to process batch additions of very small sizes (not full) and make the algorithm perform terribly. To deal with that and other edge cases, when the queue is full, we use another set of buckets, in extended jacobian coordinates to flush the queue. In practice (for uniformly distributed points), the slow down is ~5%, but worth it to avoid too many code path / edge cases.
Benchmarks
On
AWS hpc6a.48xlarge
.develop
branch againstfeat/msm-affine
(both generate uniformly distributed scalars).without split logic (we only use as many cores as nbChunks)
TLDR; from 30 to 60% speed up 😲 . Need to benchmark on a low-cost device.
bls12-377
bls12-381
bn254
bw6-761
with split logic (more cores == we split the msm)
TLDR; advantage is good for most sizes (10% to 50% perf gain), decreases with large msms probably due to the fact that we now stop at c=16. Some small sizes on G2 have significant decrease, need to tune the batchSize / choice of c for those.
bls12-377
bls12-378
bls12-381
bn254
bw6-633
bw6-761