-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blake3 single thread is slower than sha256 on Apple silicon #315
Comments
ARM NEON only provides 128-bit vector registers, compared to the 512-bit registers available on Intel CPUs that support AVX-512, and that's a big part of the difference you're seeing. There are lots of other details besides just vector size that also come into play here; see for example @sneves' comment on another recent thread. ARM SVE and SVE2 can potentially provide larger vectors, but I'm not aware of any consumer hardware that supports those. It'll probably make sense for BLAKE3 to provide an SVE implementation at some point. |
Apple Silicon also happens to have fast SHA-256 dedicated instructions. This is why |
Updating results, we now only 12-13% difference. Tested with:
|
On intel cpus I see ~10x speedup for the C implementation, but on Apple silicon it
is 1.3x times slower than sha256.
I built the C version both with cmake and manually (based on README.md), both show
same performance, matching b3sum performance with single threads.
cmake build:
Manual build:
Building the example:
Creating test file:
Testing read throughput from pipe:
Measuring hash throughput:
Note: testing with
openssl sha256
since bothshasum -a 256
andsha256sum
(from coreutils) are extremely slow (~6x times slower) on macOS.
Looking in openssl code, sha256 is using sha256-armv8.S on this machine.
Tested on MacBook Pro M2 Max.
The text was updated successfully, but these errors were encountered: