blake3 single thread is slower than sha256 on Apple silicon #315

nirs · 2023-06-17T20:25:45Z

On intel cpus I see ~10x speedup for the C implementation, but on Apple silicon it
is 1.3x times slower than sha256.

I built the C version both with cmake and manually (based on README.md), both show
same performance, matching b3sum performance with single threads.

cmake build:

% cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
-- The C compiler identification is AppleClang 14.0.3.14030022
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- 
 * NEON SIMD intrinsics, The library uses NEON SIMD intrinsics.

-- Configuring done (0.2s)
-- Generating done (0.0s)
-- Build files have been written to: /Users/nir/src/BLAKE3/c/build

% cmake --build build
[ 20%] Building C object CMakeFiles/blake3.dir/blake3.c.o
[ 40%] Building C object CMakeFiles/blake3.dir/blake3_dispatch.c.o
[ 60%] Building C object CMakeFiles/blake3.dir/blake3_portable.c.o
[ 80%] Building C object CMakeFiles/blake3.dir/blake3_neon.c.o
clang: warning: argument unused during compilation: '-mfpu=neon' [-Wunused-command-line-argument]
[100%] Linking C static library libblake3.a
[100%] Built target blake3

Manual build:

% mkdir build
% gcc -shared -O3 -o build/libblake3.so -DBLAKE3_USE_NEON=1 blake3.c blake3_dispatch.c \
    blake3_portable.c blake3_neon.c

Building the example:

% cc example.c -O3 -L build -lblake3 -o build/example

Creating test file:

% dd if=/dev/zero bs=1M count=4096 of=test.data

Testing read throughput from pipe:

 % time dd bs=64K of=/dev/null status=none < test.data
dd bs=64K of=/dev/null status=none < test.data  0.02s user 0.34s system 99% cpu 0.355 total

Measuring hash throughput:

% time openssl sha256 < test.data
8479e43911dc45e89f934fe48d01297e16f51d17aa561d4d1c216b1ae0fcddca
openssl sha256 < test.data  1.71s user 0.49s system 99% cpu 2.202 total

% time build/example < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7
build/example < test.data  2.58s user 0.36s system 99% cpu 2.942 total

 % time b3sum < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  -
b3sum < test.data  2.59s user 0.36s system 99% cpu 2.951 total

 % time b3sum --num-threads 1 < test.data
7dde7c9fed144013fedbe2b0bbf2d82f004b60b589485851cdec29b27be408d7  -
b3sum --num-threads 1 < test.data  2.58s user 0.36s system 99% cpu 2.935 total

Note: testing with openssl sha256 since both shasum -a 256 and sha256sum
(from coreutils) are extremely slow (~6x times slower) on macOS.

Looking in openssl code, sha256 is using sha256-armv8.S on this machine.

Tested on MacBook Pro M2 Max.

The text was updated successfully, but these errors were encountered:

oconnor663 · 2023-06-17T20:47:44Z

ARM NEON only provides 128-bit vector registers, compared to the 512-bit registers available on Intel CPUs that support AVX-512, and that's a big part of the difference you're seeing. There are lots of other details besides just vector size that also come into play here; see for example @sneves' comment on another recent thread.

ARM SVE and SVE2 can potentially provide larger vectors, but I'm not aware of any consumer hardware that supports those. It'll probably make sense for BLAKE3 to provide an SVE implementation at some point.

sneves · 2023-06-17T20:58:21Z

Apple Silicon also happens to have fast SHA-256 dedicated instructions. This is why openssl sha256 is much faster, since it uses them instead of a pure software implementation.

nirs · 2024-02-26T21:29:12Z

Updating results, we now only 12-13% difference.

Tested with:

OpenSSL 3.2.1 30 Jan 2024 (Library: OpenSSL 3.2.1 30 Jan 2024)
b3sum 1.5.0 (from brew)

3 versions from git commit 8fc3618

example-brew - linked with blake3 1.5.0 from brew

gcc -O3 -o example-brew c/example.c $(pkg-config --libs libblake3)

example-neon - built from source with neon support

gcc -O3 -o example-neon -DBLAKE3_USE_NEON=1 c/example.c c/blake3.c c/blake3_dispatch.c c/blake3_portable.c c/blake3_neon.c

example-portable - built from source without neon support

gcc -O3 -o example-portable -DBLAKE3_USE_NEON=0 c/example.c c/blake3.c c/blake3_dispatch.c c/blake3_portable.c

% hyperfine -w 2 "openssl sha256 < /var/tmp/1g.img" \
                 "b3sum < /var/tmp/1g.img" \
                 "./example-brew < /var/tmp/1g.img" \
                 "./example-neon < /var/tmp/1g.img" \
                 "./example-portable < /var/tmp/1g.img"
Benchmark 1: openssl sha256 < /var/tmp/1g.img
  Time (mean ± σ):     553.2 ms ±   1.2 ms    [User: 428.8 ms, System: 112.5 ms]
  Range (min … max):   550.5 ms … 555.4 ms    10 runs

Benchmark 2: b3sum < /var/tmp/1g.img
  Time (mean ± σ):     621.3 ms ±   1.2 ms    [User: 536.2 ms, System: 72.9 ms]
  Range (min … max):   619.1 ms … 622.9 ms    10 runs

Benchmark 3: ./example-brew < /var/tmp/1g.img
  Time (mean ± σ):     626.7 ms ±   1.1 ms    [User: 542.9 ms, System: 71.1 ms]
  Range (min … max):   624.8 ms … 628.0 ms    10 runs

Benchmark 4: ./example-neon < /var/tmp/1g.img
  Time (mean ± σ):     619.1 ms ±   1.2 ms    [User: 534.1 ms, System: 70.8 ms]
  Range (min … max):   617.5 ms … 622.1 ms    10 runs

Benchmark 5: ./example-portable < /var/tmp/1g.img
  Time (mean ± σ):      1.315 s ±  0.004 s    [User: 1.204 s, System: 0.084 s]
  Range (min … max):    1.308 s …  1.322 s    10 runs

Summary
  'openssl sha256 < /var/tmp/1g.img' ran
    1.12 ± 0.00 times faster than './example-neon < /var/tmp/1g.img'
    1.12 ± 0.00 times faster than 'b3sum < /var/tmp/1g.img'
    1.13 ± 0.00 times faster than './example-brew < /var/tmp/1g.img'
    2.38 ± 0.01 times faster than './example-portable < /var/tmp/1g.img'

silvanshade mentioned this issue Apr 23, 2024

Add BLAKE3 hashing algorithm (single-threaded C-based implementation) NixOS/nix#10600

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blake3 single thread is slower than sha256 on Apple silicon #315

blake3 single thread is slower than sha256 on Apple silicon #315

nirs commented Jun 17, 2023

oconnor663 commented Jun 17, 2023

sneves commented Jun 17, 2023

nirs commented Feb 26, 2024

blake3 single thread is slower than sha256 on Apple silicon #315

blake3 single thread is slower than sha256 on Apple silicon #315

Comments

nirs commented Jun 17, 2023

oconnor663 commented Jun 17, 2023

sneves commented Jun 17, 2023

nirs commented Feb 26, 2024