-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fft gpu optimization #79
Conversation
Sorry @ed255 I assigned us both but the PR is Will mark it as draft and let's wait until ready for review to check it.. |
halo2_proofs/Cargo.toml
Outdated
@@ -62,11 +78,12 @@ rand_core = { version = "0.6", default-features = false, features = ["getrandom" | |||
getrandom = { version = "0.2", features = ["js"] } | |||
|
|||
[features] | |||
default = ["shplonk"] | |||
default = ["shplonk", "gpu"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about adding this as default
now.
We will need some time to update the CI infra to support GPU usage.
cc: @AronisAt79 @ntampakas @barryWhiteHat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, thanks very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about adding this as
default
now.We will need some time to update the CI infra to support GPU usage. cc: @AronisAt79 @ntampakas @barryWhiteHat
hi @CPerezz.
May I ask how long it will take for the CI infra to be ready?
In addition to update FFT data, we ported the multiexp operation patch compatible with pairing library based on filecoin ec-gpu last week. The performance data as following:
running 1 test
Testing Multiexp for 1024 elements...
GPU took 10ms.
CPU took 6ms.
Speedup: x0.6
============================
Testing Multiexp for 2048 elements...
GPU took 8ms.
CPU took 5ms.
Speedup: x0.625
============================
Testing Multiexp for 4096 elements...
GPU took 8ms.
CPU took 6ms.
Speedup: x0.75
============================
Testing Multiexp for 8192 elements...
GPU took 10ms.
CPU took 12ms.
Speedup: x1.2
============================
Testing Multiexp for 16384 elements...
GPU took 12ms.
CPU took 22ms.
Speedup: x1.8333334
============================
Testing Multiexp for 32768 elements...
GPU took 16ms.
CPU took 40ms.
Speedup: x2.5
============================
Testing Multiexp for 65536 elements...
GPU took 25ms.
CPU took 83ms.
Speedup: x3.32
============================
Testing Multiexp for 131072 elements...
GPU took 34ms.
CPU took 169ms.
Speedup: x4.970588
============================
Testing Multiexp for 262144 elements...
GPU took 59ms.
CPU took 287ms.
Speedup: x4.8644066
============================
Testing Multiexp for 524288 elements...
GPU took 91ms.
CPU took 469ms.
Speedup: x5.1538463
============================
Testing Multiexp for 1048576 elements...
GPU took 152ms.
CPU took 864ms.
Speedup: x5.6842103
============================
Testing Multiexp for 2097152 elements...
GPU took 246ms.
CPU took 1707ms.
Speedup: x6.9390244
============================
Testing Multiexp for 4194304 elements...
GPU took 373ms.
CPU took 3431ms.
Speedup: x9.198391
============================
Testing Multiexp for 8388608 elements...
GPU took 574ms.
CPU took 6788ms.
Speedup: x11.825784
============================
Testing Multiexp for 16777216 elements...
GPU took 928ms.
CPU took 14723ms.
Speedup: x15.865302
============================
Testing Multiexp for 33554432 elements...
GPU took 1541ms.
test multiexp::tests::gpu_multiexp_consistency has been running for over 60 seconds
CPU took 32038ms.
Speedup: x20.790396
============================
Testing Multiexp for 67108864 elements...
GPU took 2997ms.
CPU took 65527ms.
Speedup: x21.864197
============================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bchyl, some of these GPU times for FFT/MSM look great! For the MSM however it seems like you'll have to test for larger number of elements because the current numbers are very small and you don't yet see the expected ~linear increase in time you'd expect (so for small multiexps there are other overheads which makes sense but that doesn't really matter).
I do have a question about the CPU times for MSM and FFTs. They seem to be extremely slow for some reason, much slower than even when running them on a normal desktop CPU, and you're running them on a very powerful machine so that doesn't make any sense. On a standard machine (8 CPU cores) you can do an FFT of 2^20 in roughly 0.25s and an MSM of 2^20 in less than 10 seconds (the exact numbers of course dependent on the specific implementation). The numbers make it look like you're running only on a single core or something. Am I misinterpreting the data or do you think there may be something up with the CPU performance numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Brechtpd, Thank you very much for your reference 8 cores performance data.
We have double checked the data of our CPU (test machine is 80 cores) and it has been updated as mentioned above.
In general, for MSM it was been consistent with your results, 2^20 less than 1s(864ms). But for fft our result was 0.263s same as your theoretical value 0.25s, it look like cpu acceleration is not obvious beyond a certain number of cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the updated numbers! I think the limited scalability beyond a certain number of CPU cores is at least partially caused by the multi-threading approach followed by the current CPU FFT/MSM implementations. Not really sure how big the impact is of that with a CPU with that many cores though.
Why was this closed @bchyl ?? I thought we should review it. |
hi, CPerezz @CPerezz Today we are refactoring the code and doing much more tests. We expect to reopen the PR about 2 days, hopefully in time for the new release. thank you very much. |
@@ -31,14 +31,29 @@ harness = false | |||
[dependencies] | |||
backtrace = { version = "0.3", optional = true } | |||
rayon = "1.5.1" | |||
ff = "0.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: here will be restored when fr/fq in pariring repo impl the char method for prime filed trait in ff/group.
merged path: zkcrypto/ff and group -> pse/pairing -> starslabhq/ff-cl-gen -> pse/gpu fft module -> pse halo2 fft
After taking params coeffs as measurement, the latest result was been as follows:
for much more better performing version based on filecoin ec-gpu
|
hi @Brechtpd, we have just enabled asm version of field basic operations on pairing_bn256. According to the following performance result, there is about 20%-30% performance improvement for CPU and GPU :
msm:
|
I think at some point we need to start investigate gpu acceleration, and this thread already contains some useful measurement, perhaps we can keep it until we really have a replacement? |
The GPU acceleration efforts can now be targeted through ZAL (see #277) |
we reuse Bellperson and change its bls12-381 to bn254.
the next is fft benchmark data on machine that
Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
80cores+ 35G mem and 4 T4 gpu:when the degree above 2^19, the data show that gpu has increasing performance advantages.
Someone who can help for me to check if this optimization works? thank you very much.