-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fiat-crypto + CryptOpt tracking issue #1261
Comments
Hi everyone, It took me a bit but I now have some numbers. I've used those nine machines
I've created a little project here, which takes this library as the base. Numbers for the Geometric mean over nine machines (per-machine tables at the end of the post)
In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the
Intel Core i7-6770HQ
Intel Core i7-10710U
Intel Core i9-10900K
Intel Core i7-11700KF
Intel Core i9-13900KF
AMD Ryzen Theadripper 1900X
AMD Ryzen 5800X
AMD Ryzen 5950X
AMD Ryzen 7950X
|
In regards to merging this, In the I've included the implementations as Side note, I wanted to include the implementations as In regards to "tell cryptopt to emit general code (i.e. no |
Thanks, you have a nice benchmark suite. :) Interestingly, Anyway, I don't think that any of this should change the plan sketched above. Since we're not in a hurry, it would be awesome to have mit-plv/fiat-crypto#1582 solved first.
Oh okay, we assumed it's a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID. |
SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the
I'll keep it in mind. I'm a bit divided on this though. CPUs since Oct '14 (Broadwell) support BMI2 / ADX. So it feel's wrong to not use them. Would be interesting to see what the usage of this library is distributed... On the other hand I do understand to be as flexible as possible. I'm using Clang 15 and GCC 12. Supercop compiler settings
output of
|
Makes sense.
I'm not convinced that it will provide much more insight right now.
Yep, exactly my thoughts. If you really have an older CPU, the C code will serve you well. |
Concept ACK on investigating fiat-crypto and (longer term) cryptopt. Would be great to have a verified implementation that's only slightly slower or even faster. Thanks @dderjoel for the benchmarks. |
Owen implemented #810 in the Fiat repo (see PR mit-plv/fiat-crypto#1582), it's not yet fully merged, but I could not wait to benchmark it. I've updated secp_bench and supercop with the new C and CryptOpt-optimized versions. I've also added our 12th Gen machine back into the benchmark suite, here the updated numbers (copied post from above, indicating changes): I've used those
I've created a little project here, which takes this library as the base. Numbers for the Geometric mean over
|
implementation | default_asm | default_c | default_c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.1257 | 14.9874 | 15.8764 | 14.9714 | 14.1071 |
ecmult_const | 30.3317 | 28.2041 | 30.3634 | 28.1846 | 26.1778 |
ecmult_1p | 23.9575 | 22.0809 | 23.641 | 22.1608 | 20.5615 |
ecmult_0p_g | 16.9942 | 15.8604 | 16.9429 | 15.9223 | 14.7684 |
ecmult_1p_g | 14.0591 | 12.8636 | 13.7521 | 12.9906 | 12.0012 |
field_half | 0.0024964 | 0.0024537 | 0.00237626 | 0.0023697 | 0.00237259 |
field_normalize | 0.00737954 | 0.00744617 | 0.00736519 | 0.00732155 | 0.00733252 |
field_normalize_weak | 0.00294988 | 0.00295974 | 0.00293822 | 0.00295422 | 0.00294591 |
field_sqr | 0.014059 | 0.0126383 | 0.0137789 | 0.0123009 | 0.0119273 |
field_mul | 0.0170241 | 0.0144868 | 0.0156579 | 0.014689 | 0.0144257 |
field_inverse | 1.40048 | 1.40896 | 1.39608 | 1.38928 | 1.3986 |
field_inverse_var | 0.915942 | 0.919022 | 0.912739 | 0.907769 | 0.910547 |
field_is_square_var | 1.20313 | 1.21123 | 1.20451 | 1.19417 | 1.19385 |
field_sqrt | 3.87547 | 3.56868 | 3.82354 | 3.44445 | 3.35331 |
In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I've included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I've included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt
is again the most performant one. (In fact everywhere except on 1900X, where the Fiat-C is the most performant one)
Implementation | Lang. | 1900X | 5800X | 5950X | 7950X | i7 6G | i7 10G | i9 10G | i7 11G | i9 12G | i9 13G | G.M. |
---|---|---|---|---|---|---|---|---|---|---|---|---|
(default_asm ) libsecp256k1 [Bitcoin Core 2021] |
asm | 719k (1.11x) | 571k (1.11x) | 567k (1.10x) | 560k (1.11x) | 565k (1.07x) | 565k (1.08x) | 564k (1.08x) | 540k (1.13x) | 438k (1.09x) | 439k (1.11x) | 548k (1.09x) |
libsecp256k1+Best (Opt) | asm | 712k (1.10x) | 543k (1.06x) | 545k (1.06x) | 544k (1.08x) | 541k (1.03x) | 541k (1.03x) | 542k (1.04x) | 512k (1.07x) | 449k (1.12x) | 447k (1.13x) | 533k (1.07x) |
(default_c52 ) libsecp256k1 [Bitcoin Core 2021] (w/o 810) |
C | 699k (1.08x) | 560k (1.09x) | 561k (1.09x) | 547k (1.09x) | 574k (1.09x) | 574k (1.10x) | 573k (1.10x) | 536k (1.12x) | 459k (1.14x) | 456k (1.15x) | 550k (1.10x) |
(default_c ) libsecp256k1 [Bitcoin Core 2021] (w/ 810) |
C | 671k (1.04x) | 538k (1.05x) | 537k (1.04x) | 531k (1.06x) | 567k (1.08x) | 563k (1.07x) | 562k (1.08x) | 504k (1.06x) | 440k (1.09x) | 439k (1.11x) | 531k (1.06x) |
(fiat_c ) libsecp256k1 + Fiat-C |
C | 648k (1.00x) | 537k (1.05x) | 536k (1.04x) | 536k (1.07x) | 549k (1.04x) | 546k (1.04x) | 548k (1.05x) | 492k (1.03x) | 418k (1.04x) | 418k (1.05x) | 519k (1.04x) |
(fiat_cryptopt ) libsecp256k1+Best (Fiat) |
asm | 675k (1.04x) | 514k (1.00x) | 515k (1.00x) | 503k (1.00x) | 526k (1.00x) | 524k (1.00x) | 523k (1.00x) | 477k (1.00x) | 402k (1.00x) | 397k (1.00x) | 500k (1.00x) |
Intel Core i7-6770HQ
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 21.9 | 21.6 | 22.4 | 22.2 | 20.0 |
ecmult_const | 40.7 | 40.8 | 42.7 | 41.7 | 36.4 |
ecmult_1p | 31.8 | 31.7 | 33.2 | 32.4 | 28.1 |
ecmult_0p_g | 22.7 | 22.9 | 24.0 | 23.4 | 20.2 |
ecmult_1p_g | 18.6 | 18.5 | 19.4 | 18.9 | 16.5 |
field_half | 0.00470 | 0.00441 | 0.00364 | 0.00436 | 0.00378 |
field_normalize | 0.0106 | 0.0108 | 0.0107 | 0.0107 | 0.0107 |
field_normalize_weak | 0.00420 | 0.00420 | 0.00421 | 0.00422 | 0.00421 |
field_sqr | 0.0184 | 0.0175 | 0.0180 | 0.0169 | 0.0162 |
field_mul | 0.0224 | 0.0198 | 0.0213 | 0.0221 | 0.0181 |
field_inverse | 1.99 | 2.00 | 2.01 | 1.99 | 2.00 |
field_inverse_var | 1.36 | 1.34 | 1.34 | 1.34 | 1.34 |
field_is_square_var | 1.73 | 1.74 | 1.74 | 1.72 | 1.73 |
field_sqrt | 5.04 | 4.83 | 4.90 | 4.66 | 4.44 |
Intel Core i7-10710U
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 19.8 | 19.3 | 20.1 | 19.7 | 17.8 |
ecmult_const | 35.8 | 35.7 | 37.4 | 36.4 | 31.9 |
ecmult_1p | 28.1 | 28.2 | 29.4 | 28.6 | 24.9 |
ecmult_0p_g | 20.3 | 20.6 | 21.4 | 21.0 | 18.2 |
ecmult_1p_g | 16.5 | 16.4 | 17.2 | 16.8 | 14.6 |
field_half | 0.00327 | 0.00370 | 0.00356 | 0.00280 | 0.00368 |
field_normalize | 0.00956 | 0.00956 | 0.00932 | 0.00954 | 0.00932 |
field_normalize_weak | 0.00373 | 0.00373 | 0.00364 | 0.00387 | 0.00364 |
field_sqr | 0.0167 | 0.0157 | 0.0157 | 0.0151 | 0.0144 |
field_mul | 0.0201 | 0.0177 | 0.0188 | 0.0186 | 0.0163 |
field_inverse | 1.71 | 1.72 | 1.72 | 1.71 | 1.72 |
field_inverse_var | 1.21 | 1.21 | 1.21 | 1.22 | 1.22 |
field_is_square_var | 1.60 | 1.54 | 1.56 | 1.57 | 1.56 |
field_sqrt | 4.50 | 4.22 | 4.30 | 4.09 | 3.90 |
Intel Core i9-10900K
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 14.9 | 14.8 | 15.3 | 15.1 | 13.6 |
ecmult_const | 27.9 | 27.9 | 29.5 | 28.5 | 24.9 |
ecmult_1p | 21.8 | 21.7 | 22.7 | 22.2 | 19.3 |
ecmult_0p_g | 15.0 | 15.2 | 15.8 | 15.4 | 13.3 |
ecmult_1p_g | 12.8 | 12.6 | 13.3 | 13.0 | 11.3 |
field_half | 0.00222 | 0.00218 | 0.00218 | 0.00215 | 0.00218 |
field_normalize | 0.00739 | 0.00731 | 0.00731 | 0.00733 | 0.00731 |
field_normalize_weak | 0.00285 | 0.00285 | 0.00286 | 0.00285 | 0.00285 |
field_sqr | 0.0125 | 0.0120 | 0.0123 | 0.0115 | 0.0111 |
field_mul | 0.0154 | 0.0136 | 0.0145 | 0.0142 | 0.0125 |
field_inverse | 1.31 | 1.31 | 1.31 | 1.31 | 1.31 |
field_inverse_var | 0.920 | 0.920 | 0.919 | 0.920 | 0.917 |
field_is_square_var | 1.18 | 1.19 | 1.19 | 1.17 | 1.17 |
field_sqrt | 3.44 | 3.30 | 3.35 | 3.18 | 3.04 |
Intel Core i7-11700KF
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 17.1 | 14.4 | 15.1 | 13.1 | 12.7 |
ecmult_const | 31.5 | 27.3 | 28.0 | 24.5 | 23.5 |
ecmult_1p | 23.7 | 21.2 | 20.6 | 19.5 | 19.5 |
ecmult_0p_g | 16.6 | 14.7 | 14.4 | 14.3 | 13.6 |
ecmult_1p_g | 14.0 | 12.2 | 12.1 | 11.6 | 11.4 |
field_half | 0.00239 | 0.00237 | 0.00240 | 0.00223 | 0.00227 |
field_normalize | 0.00779 | 0.00837 | 0.00805 | 0.00730 | 0.00740 |
field_normalize_weak | 0.00320 | 0.00327 | 0.00325 | 0.00296 | 0.00313 |
field_sqr | 0.0155 | 0.0136 | 0.0142 | 0.0125 | 0.0114 |
field_mul | 0.0175 | 0.0145 | 0.0150 | 0.0128 | 0.0144 |
field_inverse | 1.48 | 1.50 | 1.48 | 1.36 | 1.36 |
field_inverse_var | 0.951 | 0.969 | 0.961 | 0.877 | 0.877 |
field_is_square_var | 1.32 | 1.33 | 1.32 | 1.20 | 1.20 |
field_sqrt | 4.18 | 3.74 | 3.92 | 3.37 | 3.15 |
Intel Core i9-12900KF
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 11.6 | 10.6 | 11.0 | 10.4 | 9.96 |
ecmult_const | 22.6 | 20.2 | 21.3 | 19.9 | 18.9 |
ecmult_1p | 18.1 | 16.0 | 17.0 | 15.8 | 15.0 |
ecmult_0p_g | 13.5 | 11.8 | 12.5 | 11.6 | 11.9 |
ecmult_1p_g | 10.6 | 9.37 | 9.92 | 9.23 | 8.80 |
field_half | 0.00197 | 0.00198 | 0.00196 | 0.00197 | 0.00199 |
field_normalize | 0.00638 | 0.00626 | 0.00639 | 0.00641 | 0.00637 |
field_normalize_weak | 0.00266 | 0.00261 | 0.00264 | 0.00266 | 0.00266 |
field_sqr | 0.0114 | 0.00948 | 0.0114 | 0.00946 | 0.00978 |
field_mul | 0.0152 | 0.0115 | 0.0122 | 0.0113 | 0.0122 |
field_inverse | 1.25 | 1.23 | 1.25 | 1.25 | 1.25 |
field_inverse_var | 0.807 | 0.794 | 0.814 | 0.807 | 0.814 |
field_is_square_var | 1.10 | 1.08 | 1.10 | 1.10 | 1.10 |
field_sqrt | 3.26 | 2.91 | 3.25 | 2.82 | 2.84 |
Intel Core i9-13900KF
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 10.7 | 9.23 | 9.61 | 9.10 | 8.70 |
ecmult_const | 20.1 | 17.7 | 18.7 | 17.5 | 16.6 |
ecmult_1p | 15.9 | 14.1 | 14.9 | 13.9 | 13.2 |
ecmult_0p_g | 11.4 | 9.95 | 10.5 | 9.66 | 9.18 |
ecmult_1p_g | 9.30 | 8.23 | 8.72 | 8.12 | 7.74 |
field_half | 0.00183 | 0.00183 | 0.00171 | 0.00178 | 0.00183 |
field_normalize | 0.00562 | 0.00602 | 0.00571 | 0.00568 | 0.00595 |
field_normalize_weak | 0.00235 | 0.00247 | 0.00233 | 0.00233 | 0.00247 |
field_sqr | 0.00978 | 0.00932 | 0.00986 | 0.00837 | 0.00905 |
field_mul | 0.0132 | 0.0113 | 0.0107 | 0.0102 | 0.0113 |
field_inverse | 1.10 | 1.16 | 1.10 | 1.10 | 1.16 |
field_inverse_var | 0.705 | 0.747 | 0.709 | 0.707 | 0.745 |
field_is_square_var | 0.970 | 1.02 | 0.967 | 0.972 | 1.02 |
field_sqrt | 2.80 | 2.71 | 2.88 | 2.44 | 2.63 |
AMD Ryzen Theadripper 1900X
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 23.6 | 21.7 | 23.9 | 22.1 | 22.5 |
ecmult_const | 46.6 | 42.1 | 47.1 | 42.7 | 43.4 |
ecmult_1p | 36.1 | 32.4 | 36.4 | 33.0 | 32.9 |
ecmult_0p_g | 25.8 | 23.4 | 26.2 | 23.6 | 23.2 |
ecmult_1p_g | 21.0 | 18.9 | 21.2 | 19.2 | 19.1 |
field_half | 0.00264 | 0.00273 | 0.00270 | 0.00270 | 0.00275 |
field_normalize | 0.00960 | 0.00941 | 0.00962 | 0.00942 | 0.00942 |
field_normalize_weak | 0.00330 | 0.00332 | 0.00333 | 0.00355 | 0.00332 |
field_sqr | 0.0196 | 0.0184 | 0.0204 | 0.0175 | 0.0172 |
field_mul | 0.0241 | 0.0214 | 0.0240 | 0.0219 | 0.0218 |
field_inverse | 1.82 | 1.86 | 1.84 | 1.84 | 1.86 |
field_inverse_var | 1.11 | 1.11 | 1.11 | 1.10 | 1.10 |
field_is_square_var | 1.36 | 1.39 | 1.40 | 1.39 | 1.41 |
field_sqrt | 5.38 | 5.08 | 5.56 | 4.82 | 4.77 |
AMD Ryzen 5800X
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.3 | 15.4 | 16.8 | 15.6 | 14.5 |
ecmult_const | 30.1 | 28.1 | 31.9 | 28.8 | 26.6 |
ecmult_1p | 24.7 | 22.0 | 24.8 | 22.8 | 20.8 |
ecmult_0p_g | 16.5 | 15.5 | 17.5 | 15.7 | 14.5 |
ecmult_1p_g | 14.6 | 12.9 | 14.4 | 13.2 | 12.1 |
field_half | 0.00255 | 0.00219 | 0.00221 | 0.00238 | 0.00199 |
field_normalize | 0.00661 | 0.00646 | 0.00647 | 0.00655 | 0.00647 |
field_normalize_weak | 0.00275 | 0.00270 | 0.00270 | 0.00272 | 0.00270 |
field_sqr | 0.0142 | 0.0117 | 0.0140 | 0.0124 | 0.0115 |
field_mul | 0.0161 | 0.0136 | 0.0155 | 0.0144 | 0.0142 |
field_inverse | 1.28 | 1.26 | 1.25 | 1.27 | 1.26 |
field_inverse_var | 0.833 | 0.822 | 0.815 | 0.826 | 0.816 |
field_is_square_var | 1.10 | 1.09 | 1.09 | 1.10 | 1.07 |
field_sqrt | 3.81 | 3.35 | 3.84 | 3.47 | 3.30 |
AMD Ryzen 5950X
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 16.1 | 15.5 | 16.7 | 15.3 | 14.6 |
ecmult_const | 30.6 | 28.4 | 31.6 | 28.1 | 26.7 |
ecmult_1p | 25.4 | 22.4 | 25.0 | 22.1 | 20.8 |
ecmult_0p_g | 17.7 | 16.5 | 18.0 | 16.3 | 15.3 |
ecmult_1p_g | 15.0 | 13.1 | 14.1 | 13.2 | 12.2 |
field_half | 0.00223 | 0.00221 | 0.00224 | 0.00224 | 0.00218 |
field_normalize | 0.00649 | 0.00642 | 0.00642 | 0.00643 | 0.00645 |
field_normalize_weak | 0.00270 | 0.00267 | 0.00267 | 0.00267 | 0.00267 |
field_sqr | 0.0142 | 0.0115 | 0.0139 | 0.0121 | 0.0114 |
field_mul | 0.0158 | 0.0132 | 0.0154 | 0.0140 | 0.0140 |
field_inverse | 1.26 | 1.24 | 1.24 | 1.25 | 1.25 |
field_inverse_var | 0.819 | 0.808 | 0.817 | 0.816 | 0.811 |
field_is_square_var | 1.08 | 1.08 | 1.08 | 1.08 | 1.07 |
field_sqrt | 3.75 | 3.33 | 3.81 | 3.39 | 3.27 |
AMD Ryzen 7950X
bench | asm | c | c52 | fiat_c | fiat_cryptopt |
---|---|---|---|---|---|
ecmult_gen | 14.0 | 12.7 | 13.8 | 13.1 | 12.3 |
ecmult_const | 26.4 | 23.9 | 26.7 | 24.8 | 23.0 |
ecmult_1p | 20.7 | 18.6 | 20.9 | 19.5 | 18.2 |
ecmult_0p_g | 15.1 | 13.6 | 15.4 | 14.3 | 13.2 |
ecmult_1p_g | 12.1 | 10.8 | 12.1 | 11.4 | 10.4 |
field_half | 0.00213 | 0.00196 | 0.00189 | 0.00189 | 0.00189 |
field_normalize | 0.00556 | 0.00564 | 0.00549 | 0.00561 | 0.00559 |
field_normalize_weak | 0.00228 | 0.00229 | 0.00228 | 0.00230 | 0.00229 |
field_sqr | 0.0115 | 0.0106 | 0.0112 | 0.0104 | 0.00989 |
field_mul | 0.0137 | 0.0118 | 0.0136 | 0.0123 | 0.0123 |
field_inverse | 1.09 | 1.10 | 1.07 | 1.10 | 1.10 |
field_inverse_var | 0.670 | 0.680 | 0.657 | 0.681 | 0.675 |
field_is_square_var | 0.855 | 0.889 | 0.860 | 0.881 | 0.857 |
field_sqrt | 3.34 | 2.98 | 3.17 | 2.95 | 2.82 |
Hi everyone, Also, the current commits don't check if the CPU features flags I'm happy to do the bulk work but I'd need some guidance on the following
|
Hi everyone, I've updated my fork in three ways:
I believe the last thing for a PR would be to check during the CMAKE build process, if BMI2 and ADX are available. Does anyone have other ideas on what we need here? |
@dderjoel Thanks for the updates, I'm very glad to see the progress here. My thinking is that we'll actually want to split this into two separate efforts, which we can discuss and consider with different timelines:
The first is straightforward, I think: your benchmarks show that the Fiat-Crypto generated C code (with the #810 optimization) is competitive with the C code we already have, so I think there is little reason not to just outright replace that code. Since there is no reason to keep the old code too, we don't need any decision logic or build system support to choose one over the other. I have some comments on the implementation, but that can wait for a PR. Overall, this feels very close to usable. Introducing CryptoOpt is a more complex matter, because of the use of ADCX/ADOX/MULX it cannot be a drop-in replacement for the assembly code we already have. While I have no problem with optimizing exclusively for architectures which do have these, we can't just drop support for those who don't (but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO). There are a number of options:
|
Separate comment, as it feels unrelated to the rest. In inline asm blocks, all variable templates fall in one of these categories:
Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won't. If there is a PR, I'm happy to look over it. |
I agree that we should add Fiat-Crypto first separately. I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.
I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection. Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we'd probably leave performance on the table. |
Indeed. I wonder if @roconnor-blockstream would be interested in having a look at that too... (Thinking out loud) would it be possible to use Klee to prove the C code that comes out is always correct, independent from the Fiat-crypto proofs? Actually, if that's feasible, perhaps we could try to do that right now with the current C code too.
That seems reasonable.
I certainly wouldn't have non-ADCX-cryptopt-asm as the only x86_64 asm code in the library longer term; that seems like a waste. But if we don't, what do we do for non-ADCX systems in the runtime autodetection world?
|
@dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don't think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice. |
Will create a PR later which only replaces the 64bit C code. We can then continue to discuss there.
Would that be in src/selftest.h:29? static int secp256k1_selftest_passes(void) {
int ret = 1;
#if defined(USE_ASM_X86_64)
__asm__ __volatile__("cpuid\n"
"mov %%rbx, %%rax\n"
"shr $19, %%rbx\n"
"shr $8, %%rax\n"
"and %%rbx, %%rax\n"
"and $1, %%rax\n"
: "=a"(ret)
: "a"(7), "c"(0)
: "rdx", "rbx");
#fi
return secp256k1_selftest_sha256() && ret;
}
Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.
As @real-or-random already pointed out, that is a bit more complex, see also Issue 143 in the CryptOpt repo, where I explain why.
Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.
Yes, this is quite common in other libraries, too, like openSSL. If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it's a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))
That is a question for Owen and the Fiat-Crypto team. So if I look at the current 32-bit implementation, and look at the limbs I think its 10 limbs and the last limb has a width of 22 bits. When I call
and a lot of intermediate code. On the CryptOpt side, I have not tried incorporating 32-bit code yet. This is probably another engineering task, probably at the order of "get CryptOpt to generate code without ADX / BMI2", as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list. |
Interesting. Apparently there's a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn't be extremely surprising.) It looks like there's some extra reduction on the final limb? I'll take a closer look at the 32-bit code to see what the difference is. |
Thanks for looking into that, Owen. |
So I've created a branch Now I'm having a hard time to sort |
I think @real-or-random meant something even simpler (just some asm code that uses adc, which presumbly crashes on non-supporting systems), but your suggestion is better (and already introduces detection logic which can be later moved from selftesting to autodetection once that is feasible).
Great.
Ah, I see now why it's nontrivial to avoid those instructions. I assumed you already had support for mul/imul, and it was just a matter of disabling mulx... but I get why mul/imul are actually a lot harder to deal with than mulx. In either case, we still do have the option of replacing the asm code with fiat-crypto-c-fed-through-c-compiler, if we find a particularly fast combination of compiler/version/options.
I see where you're coming from now, but no, there is absolutely no assumption that the compilation is for the native system. You tell the libsecp256k1 build system what compiler suite to use, and it targets whatever that compiler is for. By default, that's your standard system compiler, which is most likely for the same architecture (x86/x86_64/arm32/aarch64/...) as the system you're running on, but there is no requirement for that. And even if it's the same architecture, it's not necessarily for your own system. In fact, I believe that there is a significant fraction of libsecp256k1 use that is built through cross-compiling (as compilation for another system than your own is called), as Bitcoin Core's release binaries (which include libsecp256k1) are created on a Linux x86_64 environment, cross-compiled for everything else (including arm32, aarch64, windows x86_64, mac osx, ...). We absolutely can't do autodetection at compile or configure time. The options are manual configuration, or runtime autodetection.
Indeed. That's the advantage of using formal methods, that they lower the review barrier significantly for having multiple options like this. It'd also open the door for things like #967, which adds a field with a more standard 4x64 limb representation, but which is only faster with asm optimization (because it much more crucially relies on fast carries, which are hard to get the compiler to emit for pure C code).
32-bit x86 is all but dead, I absolutely wouldn't suggest you spend any time on asm optimizing that. The 32-bit C code is there for weak hardware, from hardware wallet to (old) Raspberry Pi like devices. But none of those are x86. If you're looking for more work, 32-bit ARM and (increasingly) 64-bit ARM are far more interesting targets than 32-bit x86.
Thank you for looking into it! |
Yes, should trigger an re, build: Yes, I forgot about the cross compilation and releases. If we want to use runtime detection-and-selection, then I believe either the functions must be non-inlined and in e.g. here we could check a global bit which one to jump to, either the adx-asm version or to the C-compiled fallback. |
Possibly relevant: the current 10x26 field mul/sqr leave their residual carry in limbs[2] (thus exceeding 26 bits), while the most-significant limb will be fully reduced (22 bits). The 5x52 (C code, not asm) leaves theirs in the most-significant limb (limbs[4]). See this PR: #815 , in particular the first commit: 9388b77 . In that PR @sipa also anticipates that this may be better for fiat-crypto integration: #815 (comment) . I have previously mentioned a side-benefit to our magnitude tracking from rectifying this: #1060 (comment) ("Secondly...") |
Okay, I'm more than happy to do the work and let the core team here review, but I'd need some corners with some guidance, because I'm not 100% sure of what we want.
|
I think eventually we want runtime check and proper dispatch to the right version, having both compiled in. But at this stage, I think it's more interesting to just have a configure flag that enables either the asm or C (defaulting to C code), plus a selfcheck. This allows experimentation and benchmarking and testing, but avoids the need to make whatever restructuring is needed for autodetection. Moreover, it makes it easier for someone other than you do make the autodetection happen, as all the actual code would already be in. |
Just created a PR for exactly this, can I trigger CI without removing the Draft status? |
I've re-run the sec_bench suite on the exact fork on my 10 test machines, and the results is still very similar to the one before.
|
An update on this: as Joel mentioned, using the same template we used for the 64-bit code did not work to generate the 32-bit code. There are two reasons that it didn't work.
I am curious whether there is a simple explanation (other than "it makes the bounds work out right") for why we need to start the 32-bit mul with this reduction from p17, but we don't need to start the 64-bit mul with a reduction from p7. Ideally, I'd like to be able to answer questions like "Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?" Of course it isn't vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm. |
I overcame issue (2) above simply by creating a Coq implementation for the 32-bit mul which is separate from the implementation of the 64-bit mul. I followed along the C implementation from the first commit 9388b77 of #815, which @peterdettman mentioned above. Here's the C code I generated for the 32-bit mul and sqr: secp256k1_dettman_32.c. Would you be interested in including this code in the library as well? I assume we'd want some benchmarks before considering that too seriously, and I gather from the discussion in #815 that not many people have 32-bit machines on which to benchmark things like this. Note that the generated code does not include the Karatsuba optimization from the third commit of #815. I think it would be interesting to add the Karatsuba multiplication to fiat-crypto, so I would be willing to add that optimization as well. And the 32-bit code I generated does incorporate the optimization from #810. The handwritten 32-bit code from #815 also has this optimization. |
Well it pretty much is just "it makes the bounds work out right", but the crux is which of the high partial products can we do last so that our residual carry lands neatly in the most-significant-limb. We then start one higher than that. For 10x26: So our best option for the last partial products step is p6/p16, and a final carry chain landing in r9 as desired. For 5x52: So our best option for the last partial products step is p2/p7, and a final carry chain landing in r4 as desired. For our field there's (conservatively) an extra bit coming from the lower partial product in each pair and the carry-in (that I ignored above), but for a field with a prime very (very) close to a power of 2, the lower partial product would need to be included in the analysis properly. Regarding 16-bit, I doubt a practical algorithm would try to follow the pattern of these implementations. If a reduced radix is even tenable I would guess it would have to use Karatsuba, perhaps with 20 limbs in four groups of 64 bits: (13 13 13 13 12). (Note that this is already incompatible with the current assumption of 4 extra bits for carry-free linear field ops). |
Also, +1 in principle to including also the 32-bit code. |
fiat-crypto can generate verified field code for multiple targets, e.g., C and x86_64 asm. It has algorithm templates for our mul and sqr algorithm (under the name "Dettman") for secp256k1's field. But the current implementation misses at least one optimization that we have in our C code (but not asm), namely the one #810.
CryptOpt can optimize fiat-crypto's output, producing significantly faster asm while retaining the formal guarantees.
A plausible route towards using these projects in this library is:
The text was updated successfully, but these errors were encountered: