chacha20: Add full NEON backend. #310

codahale · 2022-10-30T01:14:10Z

I ported the NEON implementation of ChaCha from Crypto++ (public domain) to Rust with aarch64 intrinsics for a significant performance boost.

Observed performance changes on an Apple M1 Air:

 name                   chacha-soft ns/iter  chacha-neon ns/iter  diff ns/iter   diff %  speedup
 chacha12_bench1_16b    34 (470 MB/s)        28 (571 MB/s)                  -6  -17.65%   x 1.21
 chacha12_bench2_256b   477 (536 MB/s)       132 (1939 MB/s)              -345  -72.33%   x 3.61
 chacha12_bench3_1kib   1,897 (539 MB/s)     503 (2035 MB/s)            -1,394  -73.48%   x 3.77
 chacha12_bench4_16kib  30,811 (531 MB/s)    7,914 (2070 MB/s)         -22,897  -74.31%   x 3.89
 chacha20_bench1_16b    51 (313 MB/s)        47 (340 MB/s)                  -4   -7.84%   x 1.09
 chacha20_bench2_256b   777 (329 MB/s)       212 (1207 MB/s)              -565  -72.72%   x 3.67
 chacha20_bench3_1kib   3,088 (331 MB/s)     821 (1247 MB/s)            -2,267  -73.41%   x 3.76
 chacha20_bench4_16kib  50,251 (326 MB/s)    13,001 (1260 MB/s)        -37,250  -74.13%   x 3.87
 chacha8_bench1_16b     26 (615 MB/s)        19 (842 MB/s)                  -7  -26.92%   x 1.37
 chacha8_bench2_256b    335 (764 MB/s)       92 (2782 MB/s)               -243  -72.54%   x 3.64
 chacha8_bench3_1kib    1,328 (771 MB/s)     344 (2976 MB/s)              -984  -74.10%   x 3.86
 chacha8_bench4_16kib   21,184 (773 MB/s)    5,371 (3050 MB/s)         -15,813  -74.65%   x 3.94

Closes #287.

I’m not entirely certain I’ve got the flag/feature/cpuid token stuff right, and would appreciate any guidance about that. At this point the best I’ve got is that it very definitely works on my machine, an M1 Air. Also, no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

Observed performance changes on an Apple M1 Air: ``` name chacha-soft ns/iter chacha-neon ns/iter diff ns/iter diff % speedup chacha12_bench1_16b 34 (470 MB/s) 28 (571 MB/s) -6 -17.65% x 1.21 chacha12_bench2_256b 477 (536 MB/s) 132 (1939 MB/s) -345 -72.33% x 3.61 chacha12_bench3_1kib 1,897 (539 MB/s) 503 (2035 MB/s) -1,394 -73.48% x 3.77 chacha12_bench4_16kib 30,811 (531 MB/s) 7,914 (2070 MB/s) -22,897 -74.31% x 3.89 chacha20_bench1_16b 51 (313 MB/s) 47 (340 MB/s) -4 -7.84% x 1.09 chacha20_bench2_256b 777 (329 MB/s) 212 (1207 MB/s) -565 -72.72% x 3.67 chacha20_bench3_1kib 3,088 (331 MB/s) 821 (1247 MB/s) -2,267 -73.41% x 3.76 chacha20_bench4_16kib 50,251 (326 MB/s) 13,001 (1260 MB/s) -37,250 -74.13% x 3.87 chacha8_bench1_16b 26 (615 MB/s) 19 (842 MB/s) -7 -26.92% x 1.37 chacha8_bench2_256b 335 (764 MB/s) 92 (2782 MB/s) -243 -72.54% x 3.64 chacha8_bench3_1kib 1,328 (771 MB/s) 344 (2976 MB/s) -984 -74.10% x 3.86 chacha8_bench4_16kib 21,184 (773 MB/s) 5,371 (3050 MB/s) -15,813 -74.65% x 3.94 ```

codahale · 2022-10-30T01:21:02Z

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only. I guess feel free to revive this when/if stable SIMD lands. ☹️

tarcieri · 2022-10-30T01:44:48Z

no idea how one runs GitHub Actions on a aarch64/NEON platform. QEMU?

As it were we have another PR open to add ARM support to the keccak crate which covers a bunch of this: RustCrypto/sponges#23

You can use cross. Here's a example: https://github.com/RustCrypto/block-ciphers/blob/4334b85/.github/workflows/aes.yml#L222-L249

Unfortunately there's no M1-specific solution until this lands: github/roadmap#528

Welp, didn’t realize that vreinterpretq_u64_u32 was nightly-only.

FWIW we'd be fine with a nightly-only feature. We have similar nightly-only ARM features in the aes and polyval crates which wrap ARMv8 hardware intrinsics supported by M1s.

I guess feel free to revive this when/if stable SIMD lands.

Sure looking forward to the eventual stabilization of core::simd and the day we can have a portable SIMD implementation of ChaCha.

codahale · 2022-10-30T01:46:36Z

FWIW, this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0. I don’t know what to make of that. Any suggestions?

tarcieri · 2022-10-30T01:47:27Z

Also regarding this specifically:

...vreinterpretq_u64_u32 was nightly-only.

There are various stable workarounds, such as using transmute, pointer casts, or core::slice::from_raw_parts.

I'm guessing that's not the only nightly-only NEON intrinsics support you need, though.

tarcieri · 2022-10-30T01:48:34Z

this code compiles and passes tests on my M1 Air using stable-aarch64-apple-darwin 1.64.0

Maybe all the intrinsics you need are stabilized on 1.64?

If that's the case, you can just feature-gate support to prevent MSRV breakages.

It'd be good to know the actual MSRV of the feature.

codahale · 2022-10-30T02:03:17Z

Ok, looks like vreinterpretq_u64_u32 landed in 1.61. I re-added the neon feature to handle the MSRV issue and added target_feature gates for the backend.

codahale · 2022-10-30T14:36:59Z

Enabled CI (using the existing but commented-out cross matrix item) and updated the README.

I think this is good to go. Let me know if anything else needs changing.

tarcieri · 2022-10-31T16:04:37Z

Cool, will try to review this week sometime

tarcieri · 2022-11-12T23:15:22Z

chacha20/Cargo.toml

@@ -32,6 +32,7 @@ hex-literal = "0.3.3"
 [features]
 std = ["cipher/std"]
 zeroize = ["cipher/zeroize"]
+neon = []


It might be worth considering using a cfg! attribute here /cc @newpavlov

I think the only reason to have a gated implementation at all is MSRV compatibility. We can bump to MSRV 1.61 in the next release (bumping to 1.60 has some nice additions like weak feature activation).

Semi-related issue: RustCrypto/sponges#24 (comment)

FWIW, we use cfg! attributes to gate aes and pmull intrinsics on ARMv8

tarcieri · 2022-11-12T23:19:04Z

This otherwise seems like a fairly straightforward port of chacha_simd and I'm fine to merge it if we can figure out how it should be gated.

tarcieri · 2022-12-01T21:24:40Z

Going to merge this and follow up on how it should be gated

Mygod · 2022-12-20T17:09:51Z

Could you make a new release with this PR?

tarcieri · 2022-12-20T17:15:24Z

Unfortunately I haven’t added the gating I suggested which should be implemented prior to a release

chacha20: Gate NEON backend behind neon feature.

35492bb

codahale added 2 commits October 30, 2022 08:31

chacha20: Re-enable CI for NEON backend

5cb2100

chacha20: Update README to include neon

b748d1e

tarcieri reviewed Nov 12, 2022

View reviewed changes

tarcieri requested a review from newpavlov November 12, 2022 23:19

tarcieri merged commit 6217574 into RustCrypto:master Dec 1, 2022

codahale deleted the chacha-neon branch December 1, 2022 21:45

zonyitoo mentioned this pull request Dec 20, 2022

服务端升级到shadowsocks-rust v1.15.0之后，客户端shadowsocks-android v5.3.1-preview不能使用 shadowsocks/shadowsocks-android#2966

Closed

sorairolake mentioned this pull request Feb 12, 2023

neon feature is not yet available on crates.io shadowsocks/shadowsocks-crypto#19

Closed

tarcieri mentioned this pull request Feb 15, 2023

Publish chacha20 crate which is available neon feature #312

Closed

sylvain101010 mentioned this pull request Apr 1, 2023

Make the neon feature available for ChaCha20 #316

Closed

tarcieri mentioned this pull request Apr 1, 2023

chacha20 v0.9.1 #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chacha20: Add full NEON backend. #310

chacha20: Add full NEON backend. #310

codahale commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 30, 2022

tarcieri commented Oct 30, 2022 •

edited

Loading

codahale commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 31, 2022

tarcieri Nov 12, 2022

tarcieri commented Nov 12, 2022

tarcieri commented Dec 1, 2022

Mygod commented Dec 20, 2022

tarcieri commented Dec 20, 2022

chacha20: Add full NEON backend. #310

chacha20: Add full NEON backend. #310

Conversation

codahale commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 30, 2022

tarcieri commented Oct 30, 2022 • edited Loading

codahale commented Oct 30, 2022

codahale commented Oct 30, 2022

tarcieri commented Oct 31, 2022

tarcieri Nov 12, 2022

Choose a reason for hiding this comment

tarcieri commented Nov 12, 2022

tarcieri commented Dec 1, 2022

Mygod commented Dec 20, 2022

tarcieri commented Dec 20, 2022

tarcieri commented Oct 30, 2022 •

edited

Loading