-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement performance and memory optimizations for halo2_proofs #548
Comments
In terms of |
I analysed the functions efficiency of |
Fft Optimizationimplement 2 layer butterfly arithmeticThis allows us to bottom up n / 2 but it's hard to implement this with recursive. change par_iter when the n is large enoughIt's necessary to do experiment of the relation between overhead and degree. parallelize last butterfly arithmeticIt's easy but becomes complicated. ifft divisorPerform the multiplication on last round of butterfly arithmetic to avoid extra iteration. |
I think the following is interesting. Do you have an idea how to replace with? |
Yep, I was thinking about this myself yesterday. The inner product function we use in Halo 2 is precisely the "sum of products" I implemented there, and so I'm currently working to expose this via the |
I do not want to use raw assembly in any of our field implementations if I can avoid it. Instead, we should look at that assembly and figure out why Rust / LLVM is not generating it, and rework the Rust code to get closer. I also want to implement some proper field arithmetic backends in the |
Hi @str4d
This is amazing.
Copy that. |
The interleaved reductions likely won't have any benefit for additions, because the main speedup is due to reduced register pressure by not maintaining double-width field elements, which can be produced by multiplication but not addition. Also, additions don't use Montgomery reduction. The sharing of reductions between accumulated |
Nice.
Thank you. |
memo: par_iter vs normal |
ParallelizeFind proper task size and methods for core, thread number and memory. /// multiplication = 3 point, addition = 1 point
enum TaskSize {
// less than 3 point
Small,
// more than 4 point and less than 8 point
Middle,
// more than 9 point
Large,
}
fn task_chunk(task_size: TaskSize) -> u32 {
let chunk = match task_size {
TaskSize::Small => 8,
TaskSize::Middle => 4,
TaskSize::Large => 2,
};
chunk / log_threads()
} TaskFind the most efficient task size, thread and rayon method in order to implement condition branch to split parallel or not.
Benchnormal vs par_iter vs parallelone time field arithmetic iteration
Mul
Three Mul
Three Add
Task ChunkMac Add
Github Actions
SummaryWords
Fact
VariablesThe turning degree can be computed approximately UncertaintyI think the thread number is something to do with turning degree and would like to test with higher core machine. Environment
Reference |
Hi there. Bench |
The improvement list and status.
|
Consensys/gnark-crypto#249 is a new implementation of batch affine MSM that claims a 10-20% speedup for gnark-crypto's MSM (which was already a good implementation). |
I think we can optimize curve scalar by naf with almost zero cost. We convert field element to 256 length of bit array. This is the And scalar arithmetic implementation. The extra cost is that |
I think Karatsuba can be used for field arithmetic.
In case of |
We would improve wNAF, if we implement quadruple for curve and 4 multiplication for field. |
precomputing inv * mod may also reduce 4 times mul in montgomery reduction |
one of curious is one time modular reduction. |
poly::Evaluator
poly::Evaluator
#642poly::Evaluator
#644VerifyingKey
into the transcript when batch-verifying proofs (debug formatting takes up a non-trivial amount of time). Cache values inVerifyingKey
that can be computed on construction #607VerifyingKey
, and use it in the permutation argument. Cache values inVerifyingKey
that can be computed on construction #607parallelize
within the verifier to make it entirely single-threaded, and check the resulting single-proof times. MSM optimizations #608BatchVerifier
object (not the strategy) that pushes single-proof verification with theBatchVerifier
strategy onto the global threadpool, then accumulates their resultingMSM
s. Reworkhalo2_proofs::plonk::BatchVerifier
#610The text was updated successfully, but these errors were encountered: