Optimal number of Keccak-f1600 lanes #33

cothan · 2024-05-07T15:27:11Z

Optimal number of `Keccak-f1600` lanes

SHAKE128 function is used in gen_matrix function with KYBER_K = 2,3,4.

Here is the C reference implementation:

    for (i = 0; i < KYBER_K; i++) {
        for (j = 0; j < KYBER_K; j++) {
            if (transposed) {
                xof_absorb(&state, seed, i, j);
            } else {
                xof_absorb(&state, seed, j, i);
            }

            xof_squeezeblocks(buf, GEN_MATRIX_NBLOCKS, &state);
            buflen = GEN_MATRIX_NBLOCKS * XOF_BLOCKBYTES;
            ctr = rej_uniform(a[i].vec[j].coeffs, KYBER_N, buf, buflen);

The number of xof_absorb (SHAKE128_Absorb) are KYBER_K x KYBER_K. Thus, the two for loops will need:

4 when KYBER_K = 2.
9 when KYBER_K = 3.
16 when KYBER_K = 4.

The output state of xof_absorb will be used in xof_sqeezeblock and send to sampling function reject_uniform().
When the counter of reject_uniform() does not meet the size of vector KYBER_N (256), an additional squeeze is needed until the sampling fulfill the vector KYBER_N.

        while (ctr < KYBER_N) {
                off = buflen % 3;
                for (k = 0; k < off; k++) {
                    buf[k] = buf[buflen - off + k];
                }
                xof_squeezeblocks(buf + off, 1, &state);
                buflen = off + XOF_BLOCKBYTES;
                ctr += rej_uniform(a[i].vec[j].coeffs + ctr, KYBER_N - ctr, buf, buflen);
            }

According to Table 1 in Hanno Becker and Matthias Kannwischer paper

It shows:

x2 (2 times Keccak in a single call) is not optimal.
x3 (3 times Keccak in a single call) is better than x2, suitable for KYBER_K = 3.
x4 (4 times Keccak in a single call) is better than x2, suitable for KYBER_K = 2, 4.

Of course, the number are varies depend on the ARM processor, the relative ranking between x2-x3, x2-x4 stay the same.
I had my benchmark in the past in Apple M1 shows that x2 is better than x4, but the different is somewhat small, and the differences contribute very little to the overall speed-up of Kyber on Apple M1. So I still think using x4 is optimal for many ARM CPUs.

My conclusion:

KYBER_K = 2,4, we use x4.
KYBER_K = 3, we use x3.

What do you think? @mkannwischer @Hanno

Optimal memory `Keccak-f1600` layout:

Closely related to the implementation of Keccak-f1600, the memory layout will help make the load/store easier/faster.
Two choices of memory layout:

Vector style: Vector 1 | Vector 2 .... as in the default C implementation. To load prepare the memory for Keccak, one must cover it to lane style, to store as vector style, one must convert from lane style back to vector style.
Lane style. There is no need to prepare the memory, but for the case we need vector style, we will need additional function to convert from lane style to vector style (which can be embedded in other function calls).
a. x3: Line 1 | Line 2 | Line 3 | Line 1 | Line 2 | Line 3 | .. and repeat
b. x4: Line 1 | Line 2 | Line 3 | Line 4 | Line 1 | Line 2 | Line 3 | Line 4 | ... and repeat.

The 2. approach seem optimal to me. What do you think? @mkannwischer @hanno-becker

The text was updated successfully, but these errors were encountered:

mkannwischer · 2024-05-08T04:03:54Z

I would not fix the parallelism per Kyber parameter set, but rather make this configurable per parameter set and target CPU. It is not unimaginable that there is a CPU for which x2 actually gives you the best performance and also for some we may want to go for x8.
If I remember correctly, for M1, we expect a x8 with x6 on the Neon units and x2 on the scalar units to be optimal. But no one has tried afaik.

So maybe we provide APIs for x1, x2, x3, x4, x8? Anything missing?

For the cbd-sampled vectors you can also use parallism.

Regaring memory layout: Yes, it should be interleaved per 64-bit word.

cothan · 2024-05-08T13:47:21Z

We can either support APIs for Keccak x1, x2, x3, x4, x8 and fix optimal setting for Kyber, or generate optimal Kyber setting for target CPU. The former is easy to do, but later is preferred.

I wonder if there is an automatic way to generate the best Keccak implementation for a target CPU, I know SLOTHY target a few CPUs, it's more work for SLOTHY team or us to find optimal setting for current and future CPU. If the CPU is not supported in SLOTHY, what is our default option for Keccak?

I can attempt to design APIs for with fix parallelism first for Kyber, and see if if this approach increases the code package size.

cothan · 2024-05-08T21:20:47Z

welcome @hanno-becker!

hanno-becker · 2024-05-09T14:43:31Z

Thanks @cothan!

I agree that optimal performance may require different choices of N in the N-way Keccak permutation, but for the sake of progress and code-clarity I'd suggest we start with the fixed choice of N=4 and see how that goes. What do you think, @cothan @mkannwischer?

mkannwischer · 2024-05-10T03:29:54Z

Sounds good to me. That would also be compatible with the AVX2 implementation. Maybe the high level code can be taken from there already.

We could start with one of the implementations from https://gitlab.com/arm-research/security/pqax/-/tree/master/asm/manual/keccak_f1600?ref_type=heads. In the medium term we probably want to do the interleaving using SLOTHY - should not be too hard to achieve for the uArchs already supported by SLOTHY.

hanno-becker · 2024-05-10T05:01:04Z

I created #35 and #36 for this.

cothan · 2024-06-17T16:25:23Z

As in the PR #62 , we fix the Keccak to 4-way for now.

cothan added the enhancement New feature or request label May 7, 2024

cothan assigned cothan and mkannwischer May 7, 2024

cothan assigned hanno-becker May 8, 2024

This was referenced May 10, 2024

Add internal interface to 4-way Keccak and implement C-fallback #35

Closed

Add AArch64 implementation for 4-way parallel Keccak #36

Closed

mkannwischer mentioned this issue May 24, 2024

Evaluate sharing of C code between mlkem-c-{generic, aarch64, embedded} #42

Closed

cothan closed this as completed Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal number of Keccak-f1600 lanes #33

Optimal number of Keccak-f1600 lanes #33

cothan commented May 7, 2024 •

edited

Loading

mkannwischer commented May 8, 2024

cothan commented May 8, 2024

cothan commented May 8, 2024

hanno-becker commented May 9, 2024 •

edited

Loading

mkannwischer commented May 10, 2024

hanno-becker commented May 10, 2024

cothan commented Jun 17, 2024

Optimal number of Keccak-f1600 lanes #33

Optimal number of Keccak-f1600 lanes #33

Comments

cothan commented May 7, 2024 • edited Loading