Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal number of Keccak-f1600 lanes #33

Closed
cothan opened this issue May 7, 2024 · 7 comments
Closed

Optimal number of Keccak-f1600 lanes #33

cothan opened this issue May 7, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@cothan
Copy link
Contributor

cothan commented May 7, 2024

Optimal number of Keccak-f1600 lanes

SHAKE128 function is used in gen_matrix function with KYBER_K = 2,3,4.

Here is the C reference implementation:

    for (i = 0; i < KYBER_K; i++) {
        for (j = 0; j < KYBER_K; j++) {
            if (transposed) {
                xof_absorb(&state, seed, i, j);
            } else {
                xof_absorb(&state, seed, j, i);
            }

            xof_squeezeblocks(buf, GEN_MATRIX_NBLOCKS, &state);
            buflen = GEN_MATRIX_NBLOCKS * XOF_BLOCKBYTES;
            ctr = rej_uniform(a[i].vec[j].coeffs, KYBER_N, buf, buflen);

The number of xof_absorb (SHAKE128_Absorb) are KYBER_K x KYBER_K. Thus, the two for loops will need:

  • 4 when KYBER_K = 2.
  • 9 when KYBER_K = 3.
  • 16 when KYBER_K = 4.

The output state of xof_absorb will be used in xof_sqeezeblock and send to sampling function reject_uniform().
When the counter of reject_uniform() does not meet the size of vector KYBER_N (256), an additional squeeze is needed until the sampling fulfill the vector KYBER_N.

        while (ctr < KYBER_N) {
                off = buflen % 3;
                for (k = 0; k < off; k++) {
                    buf[k] = buf[buflen - off + k];
                }
                xof_squeezeblocks(buf + off, 1, &state);
                buflen = off + XOF_BLOCKBYTES;
                ctr += rej_uniform(a[i].vec[j].coeffs + ctr, KYBER_N - ctr, buf, buflen);
            }

According to Table 1 in Hanno Becker and Matthias Kannwischer paper

It shows:

  • x2 (2 times Keccak in a single call) is not optimal.
  • x3 (3 times Keccak in a single call) is better than x2, suitable for KYBER_K = 3.
  • x4 (4 times Keccak in a single call) is better than x2, suitable for KYBER_K = 2, 4.

Of course, the number are varies depend on the ARM processor, the relative ranking between x2-x3, x2-x4 stay the same.
I had my benchmark in the past in Apple M1 shows that x2 is better than x4, but the different is somewhat small, and the differences contribute very little to the overall speed-up of Kyber on Apple M1. So I still think using x4 is optimal for many ARM CPUs.

My conclusion:

  • KYBER_K = 2,4, we use x4.
  • KYBER_K = 3, we use x3.

What do you think? @mkannwischer @Hanno

Optimal memory Keccak-f1600 layout:

Closely related to the implementation of Keccak-f1600, the memory layout will help make the load/store easier/faster.
Two choices of memory layout:

  1. Vector style: Vector 1 | Vector 2 .... as in the default C implementation. To load prepare the memory for Keccak, one must cover it to lane style, to store as vector style, one must convert from lane style back to vector style.
  2. Lane style. There is no need to prepare the memory, but for the case we need vector style, we will need additional function to convert from lane style to vector style (which can be embedded in other function calls).
    a. x3: Line 1 | Line 2 | Line 3 | Line 1 | Line 2 | Line 3 | .. and repeat
    b. x4: Line 1 | Line 2 | Line 3 | Line 4 | Line 1 | Line 2 | Line 3 | Line 4 | ... and repeat.

The 2. approach seem optimal to me. What do you think? @mkannwischer @hanno-becker

@cothan cothan added the enhancement New feature or request label May 7, 2024
@mkannwischer
Copy link
Contributor

I would not fix the parallelism per Kyber parameter set, but rather make this configurable per parameter set and target CPU. It is not unimaginable that there is a CPU for which x2 actually gives you the best performance and also for some we may want to go for x8.
If I remember correctly, for M1, we expect a x8 with x6 on the Neon units and x2 on the scalar units to be optimal. But no one has tried afaik.

So maybe we provide APIs for x1, x2, x3, x4, x8? Anything missing?

For the cbd-sampled vectors you can also use parallism.

Regaring memory layout: Yes, it should be interleaved per 64-bit word.

@cothan
Copy link
Contributor Author

cothan commented May 8, 2024

We can either support APIs for Keccak x1, x2, x3, x4, x8 and fix optimal setting for Kyber, or generate optimal Kyber setting for target CPU. The former is easy to do, but later is preferred.

I wonder if there is an automatic way to generate the best Keccak implementation for a target CPU, I know SLOTHY target a few CPUs, it's more work for SLOTHY team or us to find optimal setting for current and future CPU. If the CPU is not supported in SLOTHY, what is our default option for Keccak?

I can attempt to design APIs for with fix parallelism first for Kyber, and see if if this approach increases the code package size.

@cothan
Copy link
Contributor Author

cothan commented May 8, 2024

welcome @hanno-becker!

@hanno-becker
Copy link
Contributor

hanno-becker commented May 9, 2024

Thanks @cothan!

I agree that optimal performance may require different choices of N in the N-way Keccak permutation, but for the sake of progress and code-clarity I'd suggest we start with the fixed choice of N=4 and see how that goes. What do you think, @cothan @mkannwischer?

@mkannwischer
Copy link
Contributor

Sounds good to me. That would also be compatible with the AVX2 implementation. Maybe the high level code can be taken from there already.

We could start with one of the implementations from https://gitlab.com/arm-research/security/pqax/-/tree/master/asm/manual/keccak_f1600?ref_type=heads. In the medium term we probably want to do the interleaving using SLOTHY - should not be too hard to achieve for the uArchs already supported by SLOTHY.

@hanno-becker
Copy link
Contributor

I created #35 and #36 for this.

@cothan
Copy link
Contributor Author

cothan commented Jun 17, 2024

As in the PR #62 , we fix the Keccak to 4-way for now.

@cothan cothan closed this as completed Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants