Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement much faster sha256 and sha512. #41

Closed
wants to merge 11 commits into from

Conversation

0xdeafbeef
Copy link

I took sha256 and sha512 variants from linux sources.
On AMD Ryzen 9 5900HS comparing

cargo bench

with

RUSTFLAGS=-Ctarget-feature=+avx2,+aes cargo bench

gives such results:

sha256                  time:   [31.047 ns 31.065 ns 31.083 ns]                    
                        change: [-79.294% -79.275% -79.257%] (p = 0.00 < 0.05)
                        Performance has improved.

sha512                  time:   [135.58 ns 135.79 ns 136.01 ns]                   
                        change: [-34.078% -33.749% -33.500%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe


Closes #5

sha2/build.rs Outdated Show resolved Hide resolved
sha2/build.rs Outdated Show resolved Hide resolved
@tarcieri
Copy link
Member

Looks very interesting, thanks! Left some notes.

@tarcieri tarcieri requested a review from newpavlov August 29, 2021 19:30
@0xdeafbeef
Copy link
Author

BTW, is Cargo.lock required?

@tarcieri
Copy link
Member

BTW, is Cargo.lock required?

What do you mean by that?

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks interesting indeed.

sha2/Cargo.toml Outdated Show resolved Hide resolved
sha2/build.rs Outdated Show resolved Hide resolved
sha2/src/lib.rs Show resolved Hide resolved
@0xdeafbeef
Copy link
Author

BTW, is Cargo.lock required?

What do you mean by that?
Why Cargo.lock is kept in library? To pin cc version?

@tarcieri
Copy link
Member

tarcieri commented Aug 29, 2021

@0xdeafbeef it makes the build deterministic, which makes it easier to spot problems arising from particular dependency changes.

It's something we do across the board, although perhaps there are repos like this one which it makes less sense for.

@tarcieri
Copy link
Member

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

@0xdeafbeef
Copy link
Author

@0xdeafbeef did you say you compared the core::arch intrinsics version for SHA-NI to the ASM?

If they're the same speed (which is what I'd expect), then it probably doesn't make sense to include ASM SHA-NI support as we already have that case covered in pure Rust.

Speed is the same. I think we should include it because if somebody uses asm feature, then he'll get much slower implementation then without it.

@tarcieri
Copy link
Member

Since we already have the intrinsic code in the sha2 crate, we can detect the sha extension there and use it if available, only then falling back onto the asm if it isn't available, i.e. SHA-NI intrinsics should be a higher precedence than asm, which AFAIK is how it already works.

Otherwise, there is duplication of the feature across the sha2 and sha2-asm crates.

@0xdeafbeef 0xdeafbeef requested a review from tarcieri September 4, 2021 14:18
@tarcieri
Copy link
Member

tarcieri commented Sep 4, 2021

Hmm, build failure seems unrelated I think?

sha2/build.rs Outdated Show resolved Hide resolved
Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One minor suggestion.

@tarcieri
Copy link
Member

tarcieri commented Sep 5, 2021

@0xdeafbeef can you rebase? I think #42 should've taken care of the build failures.

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it looks good for merging. I have only two nits and it would be nice to rebase it first.

sha2/Cargo.toml Outdated Show resolved Hide resolved
sha2/src/lib.rs Outdated Show resolved Hide resolved
sha2/src/lib.rs Outdated
extern "C" {
fn sha256_compress(state: &mut [u32; 8], block: &[u8; 64]);
fn sha256_transform_rorx(state: &mut [u32; 8], block: *const [u8; 64], num_blocks: u64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to change num_blocks to usize here. Also we probably should change sha256_compress to explicit pointer and length as well (same for sha512_compress). IIRC memory layout of slices is not guaranteed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it's guaranteed

Copy link
Member

@newpavlov newpavlov Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your link talks about layout of slice itself (i.e. about how elements of a slice a stored in memory). In this context it's more about ABI guarantees, i.e. I don't think it's currently guaranteed that val: &[u8; 16] is equivalent to val_ptr: *const [u8; 16], len: usize when used in extern "C" fns. Can you please modify the signature just to be extra safe?

sha2/src/lib.rs Outdated Show resolved Hide resolved
@@ -13,23 +13,37 @@
#[cfg(not(any(target_arch = "x86_64", target_arch = "x86", target_arch = "aarch64")))]
compile_error!("crate can only be used on x86, x86-64 and aarch64 architectures");

cpufeatures::new!(cpuid_avx2, "avx2");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gate this line on #[cfg(any(target_arch = "x86_64", target_arch = "x86"))]. Otherwise it causes compilation failure on Aarch64 targets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to modify the compress256 function (see the CI failure). Currently it tries to use the cpuid_avx2 module on all targets. I think the easiest solution would be to introduce two function with the same name one gated on x86(-64) and another one on AArch64.

@0xdeafbeef 0xdeafbeef requested a review from newpavlov September 8, 2021 20:09
@newpavlov
Copy link
Member

BTW could you also compare performance of the AVX2 based assembly with the intrinsics-based implementation from RustCrypto/hashes#312?

@0xdeafbeef
Copy link
Author

asm

test bench1_10    ... bench:          20 ns/iter (+/- 2) = 500 MB/s
test bench2_100   ... bench:         164 ns/iter (+/- 10) = 609 MB/s
test bench3_1000  ... bench:       1,451 ns/iter (+/- 135) = 689 MB/s
test bench4_10000 ... bench:      14,165 ns/iter (+/- 1,319) = 705 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          20 ns/iter (+/- 5) = 500 MB/s
test bench2_100   ... bench:         162 ns/iter (+/- 10) = 617 MB/s
test bench3_1000  ... bench:       1,408 ns/iter (+/- 159) = 710 MB/s
test bench4_10000 ... bench:      13,448 ns/iter (+/- 838) = 743 MB/s

Force soft.

running 4 tests
test bench1_10    ... bench:          23 ns/iter (+/- 4) = 434 MB/s
test bench2_100   ... bench:         196 ns/iter (+/- 23) = 510 MB/s
test bench3_1000  ... bench:       1,926 ns/iter (+/- 144) = 519 MB/s
test bench4_10000 ... bench:      18,350 ns/iter (+/- 1,070) = 544 MB/s

I think that asm version is not needed anymore.
Good job, @Rexagon!

@0xdeafbeef
Copy link
Author

0xdeafbeef commented Sep 9, 2021

After pinning to the same core
asm

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         152 ns/iter (+/- 3) = 657 MB/s
test bench3_1000  ... bench:       1,339 ns/iter (+/- 28) = 746 MB/s
test bench4_10000 ... bench:      13,041 ns/iter (+/- 343) = 766 MB/s

intrinsic

running 4 tests
test bench1_10    ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test bench2_100   ... bench:         148 ns/iter (+/- 3) = 675 MB/s
test bench3_1000  ... bench:       1,276 ns/iter (+/- 30) = 783 MB/s
test bench4_10000 ... bench:      12,420 ns/iter (+/- 275) = 805 MB/s

@newpavlov should I close pr?

@newpavlov
Copy link
Member

Hm, I am not 100% sure. Some may prefer the assembly implementation from reliability point of view, since with an intrinsics-based implementation we at the mercy of the compiler and in some cases achieved performance can be brittle. From another point of view, people usually expect that an assembly implementation is faster than a "software" one.

@tarcieri
What do you think?

@tarcieri
Copy link
Member

tarcieri commented Sep 9, 2021

Yeah, it's definitely a tradeoff. I think the biggest risk is actually miscompilation (see e.g. rust-lang/rust#79865).

That said I'd weakly be in favor of an all-intrinsics approach if performance is comparable to assembly. I think that better fits the philosophy of "Rust Crypto", and unless there are big performance wins with ASM it's probably best avoided, at least within the crates we maintain.

A pure Rust approach solves a lot of problems, especially relating to portability. Relevant: RustCrypto/hashes#315

@tarcieri tarcieri mentioned this pull request Sep 10, 2021
@newpavlov
Copy link
Member

I also lean towards the stance "assembly impls only for sufficient performance improvements", so I guess we can close this PR.

@0xdeafbeef
Thank you for you contribution (at the very least I think it was a trigger for the AVX2 impl) and sorry this PR ended like this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate to assembly from OpenSSL
3 participants