std.crypto.chacha: support larger vectors on AVX2 and AVX512 targets #15809
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ryzen 7 7700, ChaCha20/8 stream, long outputs:
Apple M1 CPUs seem to also benefit from a 4-way implementation on micro-benchmarks, but only on full-round ChaCha, and I'm not sure we can generalize it to other aarch64 CPUs, or to real workloads.
So, enable this only on x86_64 for now.verified to also improve performance on a Cortex A72, so enabling on aarch64, too.Bump the rand.chacha buffer a tiny bit to take advantage of this. More than 8 blocks doesn't seem to make any measurable difference.
ChaChaPoly also gets a small performance boost from this, albeit Poly1305 remains the bottleneck.