Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

silvanshade · 2024-04-23T21:48:38Z

Motivation

This PR adds BLAKE3 to the available hashing algorithms.

NOTE: Although this PR is complete (and I would appreciate any feedback) I'm adding it as a draft for now because I would also like to try working on alternative PR based on the multi-threaded Rust implementation (more below).

Context

Related issues:
- Implement BLAKE2b checksums #2166
- Add support for BLAKE3 hash to fetchurl #8475

The change is relatively small and non-invasive.

I added the BLAKE3 source tarball as a derivation and used this to define a NIX_BLAKE3_SRC environment variable which is then referenced in the makefiles.

The recommended way to build the BLAKE3 C implementation is just to add the files directly to the build system rather than trying to compile separately as a library, so that's the reasoning for fetching the tarball.

In order to handle building source files from NIX_BLAKE3_SRC (which is read-only) I needed to add some custom build rules specifically for those files.

I also added platform detection for ARM and x86_64 (along with detection for Darwin, Linux, and Windows) and use this to conditionally compile the appropriate SIMD implementations for the given platform.

I use the assembly files directly rather than the C-based intrinsics versions since that is also the recommended approach:

For each of the x86 SIMD instruction sets, four versions are available: three flavors of assembly (Unix, Windows MSVC, and Windows GNU) and one version using C intrinsics. The assembly versions are generally preferred. They perform better, they perform more consistently across different compilers, and they build more quickly. On the other hand, the assembly versions are x86_64-only, and you need to select the right flavor for your target platform.

The BLAKE3 dispatcher will automatically fall back to the portable implementation if a hardware accelerated implementation is unavailable.

Performance

I have run benchmarks of the implementation which I detail below.

First, though, some important things to note:

This is the C implementation, which is single-threaded. Although it is very fast, the Rust version which uses Rayon for multi-threading scales almost linearly up to memory bandwidth limits, so it's obviously significantly faster.
The NEON implementation is known to not be nearly as performant as the SSE and AVX implementations:
- NEON version is only 26% faster than portable on Raspberry Pi 4 BLAKE3-team/BLAKE3#310
- blake3 single thread is slower than sha256 on Apple silicon BLAKE3-team/BLAKE3#315
Test file was generated with head -c 5G /dev/urandom > ~/Downloads/largefile.bin

Apple M3 Max

compiled with: CFLAGS="-O3 -mcpu=apple-m2" configurePhase (and OPTIMIZE=1)

BLAKE3

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.701 s ±  0.006 s    [User: 2.345 s, System: 0.350 s]
  Range (min … max):    2.692 s …  2.714 s    10 runs

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.105 s ±  0.005 s    [User: 1.748 s, System: 0.352 s]
  Range (min … max):    2.097 s …  2.110 s    10 runs

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin
  Time (mean ± σ):      3.327 s ±  0.006 s    [User: 2.964 s, System: 0.356 s]
  Range (min … max):    3.321 s …  3.338 s    10 runs

AMD Zen 4 Ryzen 9 7950x

compiled with: CFLAGS="-O3 -march=znver4" configurePhase (and OPTIMIZE=1)

BLAKE3

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type blake3 ~/Downloads/largefile.bin
  Time (mean ± σ):     915.0 ms ±  11.2 ms    [User: 573.2 ms, System: 339.8 ms]
  Range (min … max):   898.6 ms … 932.4 ms    10 runs

SHA256

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha256 ~/Downloads/largefile.bin
  Time (mean ± σ):      2.344 s ±  0.016 s    [User: 1.980 s, System: 0.359 s]
  Range (min … max):    2.320 s …  2.371 s    10 runs

SHA512

hyperfine --warmup 3 './outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin'
Benchmark 1: ./outputs/out/bin/nix --extra-experimental-features nix-command hash file --type sha512 ~/Downloads/largefile.bin
  Time (mean ± σ):      4.529 s ±  0.039 s    [User: 4.156 s, System: 0.363 s]
  Range (min … max):    4.490 s …  4.593 s    10 runs

silvanshade · 2024-05-02T21:25:54Z

I closed this because I now have the Rust rayon-based implementation working locally which is several times faster. I'll open a new PR soon.

theoparis · 2024-05-10T21:51:35Z

I don't want to have to rely on rust to build nix 😕 A better solution would probably be implementing multithreading in the blake3 C implementation...

silvanshade · 2024-05-12T00:07:03Z

I don't want to have to rely on rust to build nix 😕

Do you have a technical reason why?

Rust is available for use in the Linux kernel, Windows kernel and APIs, Firefox, Chrome, is being integrated into GCC, and is used in many other mission critical and established software. And then there's nickel. I don't think reliability is an issue at this point.

A better solution would probably be implementing multithreading in the blake3 C implementation...

I disagree. There are good reasons why this hasn't happened.

In order for a multithreaded implementation in C++ to match the performance of the Rust Rayon-based implementation, one would need to use something like Kokkos, OpenMP, SYCL, or some similar framework.

Those frameworks often have non-trivial external dependencies and set up and are generally less portable than Rust.

Then, aside from portability, there's the overall complexity of using those frameworks, and the additional maintenance burden of a completely separate multithreaded implementation.

Some of these issues have been addressed further in the following posts from the BLAKE3 maintainer:

Also a lot of additional context in these threads:

cameronfyfe · 2024-05-15T06:16:15Z

@silvanshade let me know if you'd like help testing anything with the subsequent PR. I'd like to see blake3 support added as well.

devinrsmith · 2024-09-06T03:52:41Z

A blake3 hash would be great. I'm agnostic to whether it's single threaded C or more performant rust implementation.

silvanshade · 2024-09-07T19:16:32Z

Just want to mention that I haven't forgotten about this and intend to return to it. I'm currently working on a related project that will help with the integration here. Hopefully I can share more details soon.

MatthewCroughan · 2024-12-23T16:45:43Z

@silvanshade Having this done would be a god-send for mirroring FODs on IPFS

Ericson2314 · 2024-12-23T21:06:25Z

#11999 we do want Blake support!

Ericson2314 · 2025-01-20T00:26:40Z

Note per #11999 (comment) I hope the author will soon be reopening this :)

Add BLAKE3 hashing algorithm (single-threaded C-based implementation)

1d5b4c2

silvanshade force-pushed the blake3-c branch from a1b115c to 1d5b4c2 Compare April 23, 2024 22:17

edolstra mentioned this pull request Apr 29, 2024

nix hash to-sri could infer hash type from hash length #10606

Open

silvanshade closed this May 2, 2024

silvanshade mentioned this pull request Jan 3, 2025

Blake hashing tracking issue #11999

Open

This was referenced Jan 29, 2025

Add BLAKE3 hashing algorithm #12379

Merged

Add BLAKE3 hashing algorithm via Rust interop #12416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

silvanshade commented Apr 23, 2024 •

edited

Loading

silvanshade commented May 2, 2024

theoparis commented May 10, 2024 •

edited

Loading

silvanshade commented May 12, 2024

cameronfyfe commented May 15, 2024

devinrsmith commented Sep 6, 2024

silvanshade commented Sep 7, 2024

MatthewCroughan commented Dec 23, 2024

Ericson2314 commented Dec 23, 2024

Ericson2314 commented Jan 20, 2025

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) #10600

Conversation

silvanshade commented Apr 23, 2024 • edited Loading

Motivation

Context

Performance

Apple M3 Max

BLAKE3

AMD Zen 4 Ryzen 9 7950x

BLAKE3

SHA256

SHA512

silvanshade commented May 2, 2024

theoparis commented May 10, 2024 • edited Loading

silvanshade commented May 12, 2024

cameronfyfe commented May 15, 2024

devinrsmith commented Sep 6, 2024

silvanshade commented Sep 7, 2024

MatthewCroughan commented Dec 23, 2024

Ericson2314 commented Dec 23, 2024

Ericson2314 commented Jan 20, 2025

silvanshade commented Apr 23, 2024 •

edited

Loading

theoparis commented May 10, 2024 •

edited

Loading