Add x4 batched SHAKE128 and SHAKE256 APIs #2247

manastasova · 2025-03-06T00:30:56Z

mlkem-native uses batched SHAKE to compute different XOF digests simultaneously I. e., every call to batched SHAKE function performs four independent algorithm execution (x4).

This PR introduces the required new shake SHAKE x4 APIs. Internally, they call 4 independent SHAKE functions (either the entire algorithm or a part of it, such as init, absorb, squeeze).

Issues:

Resolves #CryptoAlg-2959

Description of changes:

SHAKE128_Init_x4
SHAKE128_Absorb_once_x4
SHAKE128_Squeezeblocks_x4
SHAKE256_x4

Call-outs:

Currently, the APIs are named as mlkem-specific functions since there is no other use-case for them (absorbing once, squeezing entire blocks). This naming will allow to implement them optimally later on, skipping some unnecessary checks. However, since they are internally calling the existing aws-lc shake functions, they inherit all functionalities supported by the aws-lc shake functions. (incremental absorbs, incremental byte-wise squeezes).

The aws-lc integration of the x4 batched SHAKE APIs will allow to comply with the current mkem-native design. These functions will not allow a SIMD assembly keccak implementation to be added to aws-lc without further changes. However, they set the batched APIs and could be inlined in mlkem-native code.

Testing:

New SHAKETest_x4 test:
./crypto/crypto_test --gtest_filter="SHAKETest_x4.RandomMessages"

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

Batched Keccak APIs would be used by mlkem/mldsa where multiple SHAKE XOF functions are computed

codecov-commenter · 2025-03-06T00:52:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.04%. Comparing base (aa43a85) to head (1d0178f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2247      +/-   ##
==========================================
+ Coverage   79.02%   79.04%   +0.01%     
==========================================
  Files         612      612              
  Lines      106588   106630      +42     
  Branches    15082    15098      +16     
==========================================
+ Hits        84232    84282      +50     
+ Misses      21703    21695       -8     
  Partials      653      653

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hanno-becker

Thank you @manastasova!

Could we please document the failure conditions for those functions? The ML-KEM call-site does not expect errors from the SHAKE API and hence needs to argue why none of the failure conditions may occurr.

(This is also an issue with the pre-existing x1 API (see #2250) for which no documentation of failure conditions is given, but the description may be easier for the non-incremental case).

Also, do we need squeeze to accept an arbitrary length, or can it be restricted to multiples of the block size? The current API forces the internal maintenance of an offset pointer, which would not be necessary if only full blocks were squeezed.

hanno-becker · 2025-03-06T06:12:10Z

crypto/fipsmodule/sha/sha3.c

+
+int SHAKE128_Init_x4(KECCAK1600_CTX_x4 *ctx) {
+
+  int ok = (SHAKE_Init(&(*ctx)[0], SHAKE128_BLOCKSIZE) &&


If it is intentional that later calls are skipped if earlier calls fail (because of lazy evaluation of &&), this may be worth documenting. Same for the other functions.

Thank you.
The batched function succeeds on success of all underlying functions. Thus, it won't be necessary to continue after fail condition is met. I added a comment here:

aws-lc/crypto/fipsmodule/sha/internal.h

Line 463 in 4fca7ae

// It fails on the first |SHAKE_Init| function fail.

hanno-becker · 2025-03-06T06:13:27Z

crypto/fipsmodule/sha/sha3.c

@@ -377,3 +377,48 @@ int SHAKE_Squeeze(uint8_t *md, KECCAK1600_CTX *ctx, size_t len) {
  //FIPS_service_indicator_update_state();
  return 1;
 }
+
+int SHAKE128_Init_x4(KECCAK1600_CTX_x4 *ctx) {


Can this function actually fail? If so, when?

I added a comment

aws-lc/crypto/fipsmodule/sha/internal.h

Line 464 in 4fca7ae

// As part of MLKEM PQ algorithm: SHAKE128_Init_x4 always returns 1 since it

.

If the function always succeeds, can we drop the return code?

I think it is better to keep it. It is, indeed, a function designed specifically for mlkem, nevertheless, nothing prevents from calling it in different use case scenario (if some). Wouldn't it be better to document the skipped return code check in mlkem-native, and argument it with the input values?

I think it is better to keep it. It is, indeed, a function designed specifically for mlkem, nevertheless, nothing prevents from calling it in different use case scenario (if some).

Yes, but my understanding is that this function will never fail, regardless of who calls it?

hanno-becker · 2025-03-06T06:13:39Z

crypto/fipsmodule/sha/sha3.c

+  return ok;
+}
+
+int SHAKE128_Absorb_once_x4(KECCAK1600_CTX_x4 *ctx, const void *data0, const void *data1,


Under what condition would this function fail?

Commented here

aws-lc/crypto/fipsmodule/sha/internal.h

Line 473 in 4fca7ae

// As part of MLKEM PQ algorithm: SHAKE128_Absorb_once_x4 always returns 1 since it

If the function always succeeds, can we drop the return code?

Same as above: I think it is better to keep it. It is, indeed, a function designed specifically for mlkem, nevertheless, nothing prevents from calling it in different use case scenario (if some). Wouldn't it be better to document the skipped return code check in mlkem-native, and argument it with the input values?

Yes, but this is not a question about the caller/use case?

hanno-becker · 2025-03-06T06:14:27Z

crypto/fipsmodule/sha/sha3.c

+  return ok;
+}
+
+int SHAKE128_Squeezeblocks_x4(uint8_t *md0, uint8_t *md1, uint8_t *md2, uint8_t *md3,


Do we need the generality of an arbitrary len, or can we stick to a multiple of the block size for now? I seem to remember that ML-KEM only squeezes a block a time.

If the generality of an arbitary length is needed, the function should not be called Squeezeblocks.

SHAKE128_Squeezeblocks_x4 aims at defining the new API needed by mlkem-native. Thus, it is defined as mlkem-specific function (only allows squeezes of entire blocks), while currently internally calling existing SHAKE sequential APIs (to split changes into incremental updates).
That is my understanding based on our previous discussions.
The same reasoning is applied to the SHAKE128_Absorb_once_x4 mlkem-specific API which inherits the generic functionality from the underlying existing SHAKE functions.

My point is that the current implementation does not squeeze a number of blocks, but len is used as the number of bytes?

I think you want something like (but 4-fold)

aws-lc/crypto/fipsmodule/ml_kem/fips202_glue.h

Lines 36 to 41 in 09ad19e

static MLK_INLINE void mlk_shake128_squeezeblocks(uint8_t *output, size_t nblocks,

mlk_shake128ctx *state) {

// TODO: Document why this function does not fail in the context

// of the calls made by mlkem-native.

(void) SHAKE_Squeeze(output, state, nblocks * SHAKE128_RATE);

}

Thank you! I just updated the code.

Add detailed description for the return valuesof SHA3 and SHAKE functions. Specify the x4 return values and fail condiditons to ease mlkem transition to aws-lc batched shake implementation (based on mlkem input values). This is needed since mlkem does not check on shake return codes.

hanno-becker · 2025-03-08T05:41:04Z

crypto/fipsmodule/sha/sha3_test.cc

+    ASSERT_TRUE(SHAKE128_Absorb_once_x4(&ctx, random_in[0], random_in[1], random_in[2], random_in[3],
+                                                                                          RAND_BYTES_x4));
+    ASSERT_TRUE(SHAKE128_Squeezeblocks_x4(digest_x4[0], digest_x4[1], digest_x4[2], digest_x4[3],
+                                                        &ctx, RAND_OUT_BYTES_BLCKS * SHAKE128_BLOCKSIZE));


Squeezeblocks should take the number of blocks, not the number of bytes, and the multiplication with SHAKE128_BLOCKSIZE to get the number of bytes should be in the implementation.

Batched block squeeze function takes number of blocks and internally requests the equvalent number of bytes from the underlying SHAKE_Squeeze generic function

justsmth · 2025-03-10T18:58:08Z

crypto/fipsmodule/sha/sha3.c

 // FIPS202 APIs manage internal input/output buffer on top of Keccak1600 API layer
+// FIPS202_Reset zero's |ctx| fields.


NP: It seems that the "FIPS202 APIs manage ..." comment might be better placed inside a multi-line /* ... */ comment as it relates to several functions below.

Thanks! Just updated.

andrewhop · 2025-03-10T20:39:57Z

crypto/fipsmodule/sha/internal.h

+OPENSSL_EXPORT int SHAKE128_Absorb_once_x4(KECCAK1600_CTX_x4 *ctx, const void *data0, const void *data1,
+                                  const void *data2, const void *data3, size_t len);


NP: can you expand and clarify this absorbs |len| bytes from data1-4 and they all have to have have the same size.

Added the info.

andrewhop · 2025-03-10T20:40:32Z

crypto/fipsmodule/sha/internal.h

+OPENSSL_EXPORT int SHAKE128_Squeezeblocks_x4(uint8_t *md0, uint8_t *md1, uint8_t *md2, uint8_t *md3,
+                                  KECCAK1600_CTX_x4 *ctx, size_t blks);


Why are you changing the parameter order from SHAKE128_Absorb_once_x4?

I was following the SHAKE/SHA3 {Init, Update/Absorb, Final/Squeeze} pattern:

aws-lc/crypto/fipsmodule/sha/internal.h

Line 441 in 4fca7ae

int SHAKE_Absorb(KECCAK1600_CTX *ctx, const void *data,

aws-lc/crypto/fipsmodule/sha/internal.h

Line 450 in 4fca7ae

int SHAKE_Squeeze(uint8_t *md, KECCAK1600_CTX *ctx, size_t len);

But I can change them all if it looks cleaner.

Huh, that's fine.

andrewhop · 2025-03-10T20:43:38Z

crypto/fipsmodule/sha/internal.h

+// As part of MLKEM PQ algorithm: SHAKE128_Init_x4 always returns 1 since it
+// is called with a valid |ctx|.


This makes sense to have here, we don't know how the ML-KEM code is going to call this function or if this is even true. This feels like this comment belongs in the ML-KEM code. Instead what we could put here is something like "If called with valid inputs this function never fails" which is true of most AWS-LC functions, the only case where valid inputs would cause AWS-LC to return a failure is if malloc failed.

Agree. Just fixed it. Thanks.

andrewhop · 2025-03-10T20:55:31Z

crypto/fipsmodule/sha/sha3.c

+  int ok = (SHAKE_Init(&(*ctx)[0], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[1], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[2], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[3], SHAKE128_BLOCKSIZE));


Is there a reason to take the address of the de-rerferenced pointer?

Suggested change

SHAKE_Init(&(*ctx)[3], SHAKE128_BLOCKSIZE));

SHAKE_Init(ctx[3], SHAKE128_BLOCKSIZE));

Since ctx is the pointer to (not actually the) array of structs:
typedef KECCAK1600_CTX KECCAK1600_CTX_x4[4];
...
KECCAK1600_CTX_x4 ctx;
ASSERT_TRUE(SHAKE128_Init_x4(**&ctx**));
We need to dereference to get back to the array of stucts, then to access some index, e.g., struct [3]; later we need the address of that struct for the next function.
int SHAKE_Init(KECCAK1600_CTX *ctx, size_t block_size);

Oh, that's subtle

crypto/fipsmodule/sha/sha3_test.cc

Co-authored-by: Andrew Hopkins <andhop@amazon.com>

hanno-becker · 2025-03-11T04:28:44Z

crypto/fipsmodule/sha/internal.h

+// SHA3_Init initialises |ctx| fields through |FIPS202_Init| and
+// returns 1 on success and 0 on failure. SHA3_Init fails if
+// |ctx| is nullptr or |bitlen| is inproper for SHA3 (any security
+// level) block size length. Otherwise, it propagates the |FIPS202_Init|


As a caller of this function, the details of which functions are called underneath are not important -- but the caller needs to understand failure conditions. Here, one could recursively lookup the documentation of FIPS202_Init to understand its failure conditions, but it's a bit cumbersome. (When) Does FIPS202_Init fail?

Is there a way one can express success/failure conditions at this level without resorting to details of which functions are being called?

Similar questions below.

I updated the description

aws-lc/crypto/fipsmodule/sha/internal.h

Line 422 in 026f32e

* SHA3 context must go through the flow: (a) Init, (b) Update [multiple times],

following the suggested documentation format. Thank you.

hanno-becker · 2025-03-11T04:33:02Z

crypto/fipsmodule/sha/internal.h

+// of equal length of |len| bytes through four consecutive calls to |SHAKE_Absorb|
+// and returns 1 on success and 0 on failure. SHAKE128_Absorb_once_x4 succeeds when
+// all four |SHAKE_Absorb| functions succeed. It fails on the first |SHAKE_Absorb|
+// function fail, however, if called with valid inputs, this function never fails.


It is unclear what 'valid' means here. You want the input to come fresh out of SHAKE128_Init_x4, or?

'Valid input' -> 'valid input pointer'.
However, I updated this following the suggested documentation format. Thank you.

hanno-becker

Thank you @manastasova for the further work!

I think there is still need for improvement in the documentation: We should not re-state implementation details of which lower level functions are called (this will change), but provide a high-level contract to the caller indicating when to expect success/failure. This may not always be possible, but in this instance, I think it may.

Concretely, could one say something like: SHAKE context must go through the flow (a) Init, (b) Absorb [many times], (c) Squeeze [many times], (d) Final. If this call-discipline is maintained and the pointers passed to the functions are valid and of respective sizes, there will be no error.

Similarly, for SHAKEx4, it would be: If you go through the flow (a) Init, (b) Absorb_once [once], (c) Squeezeblocks [many times], (d) Final -- and pointers are always valid, then there will be no error.

Can you clarify if the above is true? If so, I think this would be a good high-level documentation that callers such as ML-KEM/ML-DSA can reason against when ignoring return values, without being concerned with which lower level functions are being called, and what their failure conditions may be.

manastasova · 2025-03-11T19:37:38Z

Thank you @manastasova for the further work!

I think there is still need for improvement in the documentation: We should not re-state implementation details of which lower level functions are called (this will change), but provide a high-level contract to the caller indicating when to expect success/failure. This may not always be possible, but in this instance, I think it may.

Concretely, could one say something like: SHAKE context must go through the flow (a) Init, (b) Absorb [many times], (c) Squeeze [many times], (d) Final. If this call-discipline is maintained and the pointers passed to the functions are valid and of respective sizes, there will be no error.

Similarly, for SHAKEx4, it would be: If you go through the flow (a) Init, (b) Absorb_once [once], (c) Squeezeblocks [many times], (d) Final -- and pointers are always valid, then there will be no error.

Can you clarify if the above is true? If so, I think this would be a good high-level documentation that callers such as ML-KEM/ML-DSA can reason against when ignoring return values, without being concerned with which lower level functions are being called, and what their failure conditions may be.

Thanks @hanno-becker! I updated the comments based on your suggestions.

andrewhop · 2025-03-11T23:08:59Z

crypto/fipsmodule/sha/internal.h

+OPENSSL_EXPORT int SHAKE128_Squeezeblocks_x4(uint8_t *md0, uint8_t *md1, uint8_t *md2, uint8_t *md3,
+                                  KECCAK1600_CTX_x4 *ctx, size_t blks);


Huh, that's fine.

andrewhop · 2025-03-11T23:11:09Z

crypto/fipsmodule/sha/sha3.c

+  int ok = (SHAKE_Init(&(*ctx)[0], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[1], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[2], SHAKE128_BLOCKSIZE) &&
+            SHAKE_Init(&(*ctx)[3], SHAKE128_BLOCKSIZE));


Oh, that's subtle

hanno-becker · 2025-03-12T04:59:30Z

crypto/fipsmodule/sha/internal.h

+ * of SHA3/SHAKE API layer
+ *
+ * SHA3/SHAKE single-shot functions never fail when the later call-discipline is
+ * adhered to: (a) the pointers passed to the functions are valid, is satisfied.


The sentence structure seems off here

hanno-becker · 2025-03-12T05:02:01Z

crypto/fipsmodule/sha/internal.h

+
+// SHAKE128_Init_x4 is a batched API that operates on four independent
+// Keccak bitstates. It initialises all four |ctx| fields through four
+// consecutive calls to |SHAKE_Init| and returns 1 on success and 0 on failure.


Can we remove the mention of the implementation details here and elsewhere? Once we start implementing fast batched x4, we won't merely fall back to x1 anymore.

I tried to follow the general pattern for function description in the file/library. I was thinking once we start implementing it, we could update these comments.
I just fixed it. Thank you.

Add x4 batched SHAKE128 and SHAKE256 APIs

ea47ba4

Batched Keccak APIs would be used by mlkem/mldsa where multiple SHAKE XOF functions are computed

manastasova requested a review from a team as a code owner March 6, 2025 00:30

hanno-becker suggested changes Mar 6, 2025

View reviewed changes

hanno-becker mentioned this pull request Mar 6, 2025

Document failure conditions for SHAKE API #2250

Open

hanno-becker reviewed Mar 6, 2025

View reviewed changes

manastasova and others added 4 commits March 6, 2025 09:53

Merge branch 'main' into sha3_batched

e1ebb0e

Merge branch 'main' into sha3_batched

64675e4

Merge branch 'main' into sha3_batched

d0994ca

hanno-becker reviewed Mar 8, 2025

View reviewed changes

manastasova and others added 2 commits March 10, 2025 09:36

Merge branch 'main' into sha3_batched

f9fb913

Update batched block squeeze function

3d6ca23

Batched block squeeze function takes number of blocks and internally requests the equvalent number of bytes from the underlying SHAKE_Squeeze generic function

justsmth previously approved these changes Mar 10, 2025

View reviewed changes

manastasova and others added 2 commits March 10, 2025 12:42

Merge branch 'main' into sha3_batched

e6e78a2

Formatting

8c9a1a9

manastasova dismissed justsmth’s stale review via 8c9a1a9 March 10, 2025 19:51

andrewhop reviewed Mar 10, 2025

View reviewed changes

manastasova and others added 3 commits March 10, 2025 15:52

Merge branch 'main' into sha3_batched

55f9949

Comment Updates

87a41b1

Update crypto/fipsmodule/sha/sha3_test.cc

201b74f

Co-authored-by: Andrew Hopkins <andhop@amazon.com>

hanno-becker reviewed Mar 11, 2025

View reviewed changes

hanno-becker suggested changes Mar 11, 2025

View reviewed changes

manastasova and others added 2 commits March 11, 2025 12:27

Merge branch 'main' into sha3_batched

ce086d1

Update Documentation SHA3/SHAKE APIs

026f32e

Merge branch 'main' into sha3_batched

2abddfa

andrewhop previously approved these changes Mar 11, 2025

View reviewed changes

hanno-becker reviewed Mar 12, 2025

View reviewed changes

Update comments

1d0178f

manastasova dismissed andrewhop’s stale review via 1d0178f March 12, 2025 13:47

justsmth approved these changes Mar 12, 2025

View reviewed changes

andrewhop approved these changes Mar 12, 2025

View reviewed changes

andrewhop merged commit 1cf1ee2 into aws:main Mar 12, 2025
102 of 107 checks passed


		int SHAKE128_Init_x4(KECCAK1600_CTX_x4 *ctx) {

		int ok = (SHAKE_Init(&(*ctx)[0], SHAKE128_BLOCKSIZE) &&

	static MLK_INLINE void mlk_shake128_squeezeblocks(uint8_t *output, size_t nblocks,
	mlk_shake128ctx *state) {
	// TODO: Document why this function does not fail in the context
	// of the calls made by mlkem-native.
	(void) SHAKE_Squeeze(output, state, nblocks * SHAKE128_RATE);
	}

		// FIPS202 APIs manage internal input/output buffer on top of Keccak1600 API layer
		// FIPS202_Reset zero's \|ctx\| fields.

		OPENSSL_EXPORT int SHAKE128_Absorb_once_x4(KECCAK1600_CTX_x4 ctx, const void data0, const void *data1,
		const void data2, const void data3, size_t len);

		OPENSSL_EXPORT int SHAKE128_Squeezeblocks_x4(uint8_t md0, uint8_t md1, uint8_t md2, uint8_t md3,
		KECCAK1600_CTX_x4 *ctx, size_t blks);

		// As part of MLKEM PQ algorithm: SHAKE128_Init_x4 always returns 1 since it
		// is called with a valid \|ctx\|.

	SHAKE_Init(&(*ctx)[3], SHAKE128_BLOCKSIZE));
	SHAKE_Init(ctx[3], SHAKE128_BLOCKSIZE));

Add x4 batched SHAKE128 and SHAKE256 APIs #2247

Add x4 batched SHAKE128 and SHAKE256 APIs #2247

Conversation

manastasova commented Mar 6, 2025 • edited Loading

Issues:

Description of changes:

Call-outs:

Testing:

codecov-commenter commented Mar 6, 2025 • edited Loading

Codecov Report

hanno-becker left a comment • edited Loading

Choose a reason for hiding this comment

hanno-becker Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanno-becker Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanno-becker left a comment • edited Loading

Choose a reason for hiding this comment

manastasova commented Mar 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manastasova commented Mar 6, 2025 •

edited

Loading

codecov-commenter commented Mar 6, 2025 •

edited

Loading

hanno-becker left a comment •

edited

Loading

hanno-becker Mar 6, 2025 •

edited

Loading

hanno-becker Mar 11, 2025 •

edited

Loading

hanno-becker left a comment •

edited

Loading