-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimal number of Keccak-f1600 lanes #33
Comments
I would not fix the parallelism per Kyber parameter set, but rather make this configurable per parameter set and target CPU. It is not unimaginable that there is a CPU for which x2 actually gives you the best performance and also for some we may want to go for x8. So maybe we provide APIs for x1, x2, x3, x4, x8? Anything missing? For the cbd-sampled vectors you can also use parallism. Regaring memory layout: Yes, it should be interleaved per 64-bit word. |
We can either support APIs for Keccak x1, x2, x3, x4, x8 and fix optimal setting for Kyber, or generate optimal Kyber setting for target CPU. The former is easy to do, but later is preferred. I wonder if there is an automatic way to generate the best Keccak implementation for a target CPU, I know SLOTHY target a few CPUs, it's more work for SLOTHY team or us to find optimal setting for current and future CPU. If the CPU is not supported in SLOTHY, what is our default option for Keccak? I can attempt to design APIs for with fix parallelism first for Kyber, and see if if this approach increases the code package size. |
welcome @hanno-becker! |
Thanks @cothan! I agree that optimal performance may require different choices of N in the N-way Keccak permutation, but for the sake of progress and code-clarity I'd suggest we start with the fixed choice of N=4 and see how that goes. What do you think, @cothan @mkannwischer? |
Sounds good to me. That would also be compatible with the AVX2 implementation. Maybe the high level code can be taken from there already. We could start with one of the implementations from https://gitlab.com/arm-research/security/pqax/-/tree/master/asm/manual/keccak_f1600?ref_type=heads. In the medium term we probably want to do the interleaving using SLOTHY - should not be too hard to achieve for the uArchs already supported by SLOTHY. |
As in the PR #62 , we fix the Keccak to 4-way for now. |
Optimal number of
Keccak-f1600
lanesSHAKE128 function is used in
gen_matrix
function withKYBER_K = 2,3,4
.Here is the C reference implementation:
The number of
xof_absorb
(SHAKE128_Absorb
) areKYBER_K x KYBER_K
. Thus, the two for loops will need:KYBER_K = 2
.KYBER_K = 3
.KYBER_K = 4
.The output state of
xof_absorb
will be used inxof_sqeezeblock
and send to sampling functionreject_uniform()
.When the counter of
reject_uniform()
does not meet the size of vector KYBER_N (256), an additional squeeze is needed until the sampling fulfill the vector KYBER_N.According to Table 1 in Hanno Becker and Matthias Kannwischer paper
It shows:
KYBER_K = 3
.KYBER_K = 2, 4
.Of course, the number are varies depend on the ARM processor, the relative ranking between x2-x3, x2-x4 stay the same.
I had my benchmark in the past in Apple M1 shows that
x2
is better thanx4
, but the different is somewhat small, and the differences contribute very little to the overall speed-up of Kyber on Apple M1. So I still think usingx4
is optimal for many ARM CPUs.My conclusion:
KYBER_K = 2,4
, we use x4.KYBER_K = 3
, we use x3.What do you think? @mkannwischer @Hanno
Optimal memory
Keccak-f1600
layout:Closely related to the implementation of Keccak-f1600, the memory layout will help make the load/store easier/faster.
Two choices of memory layout:
a. x3:
Line 1 | Line 2 | Line 3 | Line 1 | Line 2 | Line 3 | .. and repeat
b. x4:
Line 1 | Line 2 | Line 3 | Line 4 | Line 1 | Line 2 | Line 3 | Line 4 | ... and repeat
.The 2. approach seem optimal to me. What do you think? @mkannwischer @hanno-becker
The text was updated successfully, but these errors were encountered: