-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600
Conversation
I closed this because I now have the Rust rayon-based implementation working locally which is several times faster. I'll open a new PR soon. |
I don't want to have to rely on rust to build nix 😕 A better solution would probably be implementing multithreading in the blake3 C implementation... |
Do you have a technical reason why? Rust is available for use in the Linux kernel, Windows kernel and APIs, Firefox, Chrome, is being integrated into GCC, and is used in many other mission critical and established software. And then there's nickel. I don't think reliability is an issue at this point.
I disagree. There are good reasons why this hasn't happened. In order for a multithreaded implementation in C++ to match the performance of the Rust Rayon-based implementation, one would need to use something like Kokkos, OpenMP, SYCL, or some similar framework. Those frameworks often have non-trivial external dependencies and set up and are generally less portable than Rust. Then, aside from portability, there's the overall complexity of using those frameworks, and the additional maintenance burden of a completely separate multithreaded implementation. Some of these issues have been addressed further in the following posts from the BLAKE3 maintainer:
Also a lot of additional context in these threads: |
@silvanshade let me know if you'd like help testing anything with the subsequent PR. I'd like to see blake3 support added as well. |
A blake3 hash would be great. I'm agnostic to whether it's single threaded C or more performant rust implementation. |
Just want to mention that I haven't forgotten about this and intend to return to it. I'm currently working on a related project that will help with the integration here. Hopefully I can share more details soon. |
@silvanshade Having this done would be a god-send for mirroring FODs on IPFS |
#11999 we do want Blake support! |
Note per #11999 (comment) I hope the author will soon be reopening this :) |
Motivation
This PR adds BLAKE3 to the available hashing algorithms.
NOTE: Although this PR is complete (and I would appreciate any feedback) I'm adding it as a draft for now because I would also like to try working on alternative PR based on the multi-threaded Rust implementation (more below).
Context
The change is relatively small and non-invasive.
I added the BLAKE3 source tarball as a derivation and used this to define a
NIX_BLAKE3_SRC
environment variable which is then referenced in the makefiles.The recommended way to build the BLAKE3 C implementation is just to add the files directly to the build system rather than trying to compile separately as a library, so that's the reasoning for fetching the tarball.
In order to handle building source files from
NIX_BLAKE3_SRC
(which is read-only) I needed to add some custom build rules specifically for those files.I also added platform detection for
ARM
andx86_64
(along with detection for Darwin, Linux, and Windows) and use this to conditionally compile the appropriate SIMD implementations for the given platform.I use the assembly files directly rather than the C-based intrinsics versions since that is also the recommended approach:
The BLAKE3 dispatcher will automatically fall back to the portable implementation if a hardware accelerated implementation is unavailable.
Performance
I have run benchmarks of the implementation which I detail below.
First, though, some important things to note:
This is the C implementation, which is single-threaded. Although it is very fast, the Rust version which uses Rayon for multi-threading scales almost linearly up to memory bandwidth limits, so it's obviously significantly faster.
The NEON implementation is known to not be nearly as performant as the SSE and AVX implementations:
Test file was generated with
head -c 5G /dev/urandom > ~/Downloads/largefile.bin
Apple M3 Max
CFLAGS="-O3 -mcpu=apple-m2" configurePhase
(andOPTIMIZE=1
)BLAKE3
AMD Zen 4 Ryzen 9 7950x
CFLAGS="-O3 -march=znver4" configurePhase
(andOPTIMIZE=1
)BLAKE3
SHA256
SHA512