Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poly1305 AVX2 backend #49

Merged
merged 8 commits into from
Sep 9, 2020
Merged

Poly1305 AVX2 backend #49

merged 8 commits into from
Sep 9, 2020

Conversation

str4d
Copy link
Contributor

@str4d str4d commented Apr 6, 2020

The AVX2 backend can be selected at compile time with RUSTFLAGS "-Ctarget-feature=+avx2".

@str4d
Copy link
Contributor Author

str4d commented Apr 6, 2020

Currently a draft PR because there are still several bugs to fix:

  • The "total" test vector from poly1305-donna is the only remaining test vector that fail when run on the AVX2 backend, indicating that there is a bug somewhere in the handling of BlocksX4.
  • I ran AFL to fuzz the AVX2 backend against the software implementation, and it found eight unique differences. I've fixed one of these so far (x was not being fully carried in impl Add<Aligned130> for AdditionKey).

@str4d
Copy link
Contributor Author

str4d commented Apr 6, 2020

Welp, somewhere along the way during my refactoring and documenting process, I've managed to make the AVX2 implementation incredibly slow 😑 When I first got the code running, it was speeding rage up from being 2.1x slower than age to only 1.3x slower (making the AVX2 backend around 3.5x faster). But now this branch is around 45x slower than the software implementation! I've clearly done something terrible to the underlying AVX2 assembly in my refactor. The main culprit I can think of would be my renaming of variables: the original code reuses variables all over the place, whereas I assigned new variables in many places to clarify the usage.

@tarcieri
Copy link
Member

tarcieri commented Apr 6, 2020

The main culprit I can think of would be my renaming of variables: the original code reuses variables all over the place, whereas I assigned new variables in many places to clarify the usage.

Huh, strange, I would assume that LLVM's SSA transform would render these equivalent, at least if the changes were purely superficial...

@tarcieri
Copy link
Member

tarcieri commented Apr 6, 2020

Regarding runtime selection of backends: we've largely eschewed it so far for a few different reasons.

@newpavlov's suggestion has been to push that selection higher up into e.g. the ChaCha20Poly1305 AEAD construction.

You might check out this (unmerged) PR for runtime detection of PCLMULQDQ for POLYVAL for the past discussion: #11

@str4d
Copy link
Contributor Author

str4d commented Apr 6, 2020

Huh, strange, I would assume that LLVM's SSA transform would render these equivalent, at least if the changes were purely superficial...

The changes were all just reorganising the existing AVX2 intrinsics to leverage the Rust type system. Next weekend I'll try swapping back in the un-refactored code to confirm it was indeed faster, and then I'll start trying to figure out how the underlying assembly is being altered.

@newpavlov's suggestion has been to push that selection higher up into e.g. the ChaCha20Poly1305 AEAD construction.

That makes sense to me; we should be consistently selecting the AVX2 backends for the ChaCha20 and Poly1305 components together. I do want auto-detection somewhere though, as it's very inconvenient both as an application developer to be forced to generate separate AVX2 binaries, and as a user to need to care about exactly which link I click to get performance. If someone does want smaller AVX2-only binaries though, it should be possible to obtain them.

@str4d
Copy link
Contributor Author

str4d commented Apr 10, 2020

Hmm... it turns out that the AVX2 backend is only ridiculously slow (on my laptop) when compiled without the +avx2 Rust flag. When I add that, it speeds up to... just slightly slower than the software backend. So there are two issues:

  • The runtime detection is not functioning correctly.
    • I'm going to remove it for now, and just rely on the compile-time flag.
  • The AVX2 backend is still not faster than the software backend.
    • I'll look at the assembly and try to figure out why. Maybe Rust/LLVM is just really good at vectorising? More likely there are still mistakes in the AVX2 backend.

@str4d str4d force-pushed the poly1305-avx2 branch 2 times, most recently from dff6ae7 to d7c38e3 Compare April 10, 2020 12:11
@str4d
Copy link
Contributor Author

str4d commented Apr 10, 2020

I've force-pushed to switch to compile-time backend selection. Unfortunately this means removing the comparative fuzzer; I will rework the crashes it found into regular test cases, so we can still figure them out.

As for the speed issue... it looks to me like the AVX2 backend just generates more assembly than the software backend. It would be useful for someone else to try running this branch to see what their experience is.

@tarcieri
Copy link
Member

Unfortunately this means removing the comparative fuzzer

FWIW, the polyval crate always includes the portable backend even when the CLMUL one is enabled, but sets up type aliases for the former in the event the latter is disabled:

https://github.com/RustCrypto/universal-hashes/blob/master/polyval/src/field.rs

Kinda gross, but I did it for the same reason (at least initially): equivalence testing/fuzzing between the two

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2020

Codecov Report

Merging #49 into master will decrease coverage by 35.00%.
The diff coverage is 0.20%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #49       +/-   ##
===========================================
- Coverage   69.21%   34.20%   -35.01%     
===========================================
  Files          11       13        +2     
  Lines         471      953      +482     
===========================================
  Hits          326      326               
- Misses        145      627      +482     
Impacted Files Coverage Δ
poly1305/src/avx2.rs 0.00% <0.00%> (ø)
poly1305/src/avx2/helpers.rs 0.00% <0.00%> (ø)
poly1305/src/lib.rs 88.88% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6a1e0c...0256f21. Read the comment docs.

@str4d
Copy link
Contributor Author

str4d commented Sep 5, 2020

I've rebased this onto current HEAD, and... it's now 33% faster than the software backend, with no changes to the AVX2 instructions. No idea what's going on there, but all the tests pass 🤷

@str4d
Copy link
Contributor Author

str4d commented Sep 5, 2020

Looking again at the failures I reported earlier:

  • The "total" test vector from poly1305-donna is the only remaining test vector that fail when run on the AVX2 backend, indicating that there is a bug somewhere in the handling of BlocksX4.

This test vector was removed by @tarcieri; was it just an invalid vector?

  • I ran AFL to fuzz the AVX2 backend against the software implementation, and it found eight unique differences. I've fixed one of these so far (x was not being fully carried in impl Add<Aligned130> for AdditionKey).

I re-added the fuzzer helper, and the eight test cases it found previously. Seven of them are still failing.

@tarcieri
Copy link
Member

tarcieri commented Sep 5, 2020

This test vector was removed by @tarcieri; was it just an invalid vector?

If there were any I removed which didn't get added back it was a mistake as part of some of the refactoring I was doing (#55)

@str4d str4d force-pushed the poly1305-avx2 branch 3 times, most recently from 16fc637 to 938a710 Compare September 6, 2020 12:43
Originally derived from Goll and Gueron's AVX2 C code. The logic has
been extensively rewritten and documented, and several bugs in the
original C code were fixed.
The crash test cases were found by fuzzing with AFL after the AVX2
backend had been reviewed, refactored, and documented.
@str4d
Copy link
Contributor Author

str4d commented Sep 6, 2020

I've rebased the PR (so it now contains all the Poly1305 test vectors), and split out my (likely inefficient) fix for donna_self_test1 and fuzz::crash_6 into a separate commit.

I'm working on fuzz::crash_3 next, but all seven remaining fuzzer crashes exhibit a similar failure mode: if you finalize after some multiple of 4 blocks, the AVX2 backend's accumulator is 0x0100000000 smaller than the software backend's accumulator. My initial guess is that a carry bit from a 32-bit limb is not being correctly propagated somewhere.

@str4d str4d force-pushed the poly1305-avx2 branch 3 times, most recently from 3c481a6 to c1131db Compare September 6, 2020 23:00
@str4d
Copy link
Contributor Author

str4d commented Sep 7, 2020

I found the bug! All tests and previous fuzzer crashes now pass 🎉

@tarcieri
Copy link
Member

tarcieri commented Sep 7, 2020

@str4d want to mark it as ready for review?

No changes to the logic or AVX2 instructions.

Includes documentation for `fuzz::crash_3`.
Fixes all remaining fuzzer crashes and test vectors.
@str4d
Copy link
Contributor Author

str4d commented Sep 7, 2020

I had a couple of cleanups to do, and I wanted to throw the fuzzer at it again for a few minutes (to shake out any other low-hanging crashes). Done the first, and nothing has com LITERALLY as I was typing this, the fuzzer turned up a crash 😂

Found from the included new fuzzer crash.
@str4d
Copy link
Contributor Author

str4d commented Sep 8, 2020

That last crash is now fixed. I found it after 3 minutes of fuzzing, and didn't find anything else after an additional 20 minutes. I'll run some more fuzzing today, but for now I think this is ready to review 😄

@str4d str4d marked this pull request as ready for review September 8, 2020 12:09
@tarcieri tarcieri requested a review from newpavlov September 8, 2020 15:18
@tarcieri tarcieri merged commit 657c174 into RustCrypto:master Sep 9, 2020
@str4d str4d deleted the poly1305-avx2 branch September 9, 2020 08:46
@tarcieri tarcieri mentioned this pull request Sep 29, 2020
@tarcieri tarcieri mentioned this pull request Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants