Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: Add full NEON backend. #310

Merged
merged 4 commits into from
Dec 1, 2022
Merged

chacha20: Add full NEON backend. #310

merged 4 commits into from
Dec 1, 2022

Conversation

codahale
Copy link
Contributor

I ported the NEON implementation of ChaCha from Crypto++ (public domain) to Rust with aarch64 intrinsics for a significant performance boost.

Observed performance changes on an Apple M1 Air:

 name                   chacha-soft ns/iter  chacha-neon ns/iter  diff ns/iter   diff %  speedup
 chacha12_bench1_16b    34 (470 MB/s)        28 (571 MB/s)                  -6  -17.65%   x 1.21
 chacha12_bench2_256b   477 (536 MB/s)       132 (1939 MB/s)              -345  -72.33%   x 3.61
 chacha12_bench3_1kib   1,897 (539 MB/s)     503 (2035 MB/s)            -1,394  -73.48%   x 3.77
 chacha12_bench4_16kib  30,811 (531 MB/s)    7,914 (2070 MB/s)         -22,897  -74.31%   x 3.89
 chacha20_bench1_16b    51 (313 MB/s)        47 (340 MB/s)                  -4   -7.84%   x 1.09
 chacha20_bench2_256b   777 (329 MB/s)       212 (1207 MB/s)              -565  -72.72%   x 3.67
 chacha20_bench3_1kib   3,088 (331 MB/s)     821 (1247 MB/s)            -2,267  -73.41%   x 3.76
 chacha20_bench4_16kib  50,251 (326 MB/s)    13,001 (1260 MB/s)        -37,250  -74.13%   x 3.87
 chacha8_bench1_16b     26 (615 MB/s)        19 (842 MB/s)                  -7  -26.92%   x 1.37
 chacha8_bench2_256b    335 (764 MB/s)       92 (2782 MB/s)               -243  -72.54%   x 3.64
 chacha8_bench3_1kib    1,328 (771 MB/s)     344 (2976 MB/s)              -984  -74.10%   x 3.86
 chacha8_bench4_16kib   21,184 (773 MB/s)    5,371 (3050 MB/s)         -15,813  -74.65%   x 3.94

Closes #287.

I’m not entirely certain I’ve got the flag/feature/cpuid token stuff right, and would appreciate any guidance about that. At this point the best I’ve got is that it very definitely works on my machine, an M1 Air. Also, no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

Observed performance changes on an Apple M1 Air:

```
 name                   chacha-soft ns/iter  chacha-neon ns/iter  diff ns/iter   diff %  speedup
 chacha12_bench1_16b    34 (470 MB/s)        28 (571 MB/s)                  -6  -17.65%   x 1.21
 chacha12_bench2_256b   477 (536 MB/s)       132 (1939 MB/s)              -345  -72.33%   x 3.61
 chacha12_bench3_1kib   1,897 (539 MB/s)     503 (2035 MB/s)            -1,394  -73.48%   x 3.77
 chacha12_bench4_16kib  30,811 (531 MB/s)    7,914 (2070 MB/s)         -22,897  -74.31%   x 3.89
 chacha20_bench1_16b    51 (313 MB/s)        47 (340 MB/s)                  -4   -7.84%   x 1.09
 chacha20_bench2_256b   777 (329 MB/s)       212 (1207 MB/s)              -565  -72.72%   x 3.67
 chacha20_bench3_1kib   3,088 (331 MB/s)     821 (1247 MB/s)            -2,267  -73.41%   x 3.76
 chacha20_bench4_16kib  50,251 (326 MB/s)    13,001 (1260 MB/s)        -37,250  -74.13%   x 3.87
 chacha8_bench1_16b     26 (615 MB/s)        19 (842 MB/s)                  -7  -26.92%   x 1.37
 chacha8_bench2_256b    335 (764 MB/s)       92 (2782 MB/s)               -243  -72.54%   x 3.64
 chacha8_bench3_1kib    1,328 (771 MB/s)     344 (2976 MB/s)              -984  -74.10%   x 3.86
 chacha8_bench4_16kib   21,184 (773 MB/s)    5,371 (3050 MB/s)         -15,813  -74.65%   x 3.94
 ```
@codahale
Copy link
Contributor Author

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only. I guess feel free to revive this when/if stable SIMD lands. ☹️

@tarcieri
Copy link
Member

no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

As it were we have another PR open to add ARM support to the keccak crate which covers a bunch of this: RustCrypto/sponges#23

You can use cross. Here's a example: https://github.com/RustCrypto/block-ciphers/blob/4334b85/.github/workflows/aes.yml#L222-L249

Unfortunately there's no M1-specific solution until this lands: github/roadmap#528

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only.

FWIW we'd be fine with a nightly-only feature. We have similar nightly-only ARM features in the aes and polyval crates which wrap ARMv8 hardware intrinsics supported by M1s.

I guess feel free to revive this when/if stable SIMD lands.

Sure looking forward to the eventual stabilization of core::simd and the day we can have a portable SIMD implementation of ChaCha.

@codahale
Copy link
Contributor Author

FWIW, this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0. I don’t know what to make of that. Any suggestions?

@tarcieri
Copy link
Member

Also regarding this specifically:

...vreinterpretq_u64_u32 was nightly-only.

There are various stable workarounds, such as using transmute, pointer casts, or core::slice::from_raw_parts.

I'm guessing that's not the only nightly-only NEON intrinsics support you need, though.

@tarcieri
Copy link
Member

tarcieri commented Oct 30, 2022

this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0

Maybe all the intrinsics you need are stabilized on 1.64?

If that's the case, you can just feature-gate support to prevent MSRV breakages.

It'd be good to know the actual MSRV of the feature.

@codahale
Copy link
Contributor Author

Ok, looks like vreinterpretq_u64_u32 landed in 1.61. I re-added the neon feature to handle the MSRV issue and added target_feature gates for the backend.

@codahale
Copy link
Contributor Author

Enabled CI (using the existing but commented-out cross matrix item) and updated the README.

I think this is good to go. Let me know if anything else needs changing.

@tarcieri
Copy link
Member

Cool, will try to review this week sometime

@@ -32,6 +32,7 @@ hex-literal = "0.3.3"
[features]
std = ["cipher/std"]
zeroize = ["cipher/zeroize"]
neon = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth considering using a cfg! attribute here /cc @newpavlov

I think the only reason to have a gated implementation at all is MSRV compatibility. We can bump to MSRV 1.61 in the next release (bumping to 1.60 has some nice additions like weak feature activation).

Semi-related issue: RustCrypto/sponges#24 (comment)

FWIW, we use cfg! attributes to gate aes and pmull intrinsics on ARMv8

@tarcieri
Copy link
Member

This otherwise seems like a fairly straightforward port of chacha_simd and I'm fine to merge it if we can figure out how it should be gated.

@tarcieri
Copy link
Member

tarcieri commented Dec 1, 2022

Going to merge this and follow up on how it should be gated

@Mygod
Copy link

Mygod commented Dec 20, 2022

Could you make a new release with this PR?

@tarcieri
Copy link
Member

Unfortunately I haven’t added the gating I suggested which should be implemented prior to a release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port chacha20 SIMD backends
3 participants