Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider multi-message batching #386

Open
riptl opened this issue Feb 23, 2024 · 3 comments
Open

Consider multi-message batching #386

riptl opened this issue Feb 23, 2024 · 3 comments

Comments

@riptl
Copy link

riptl commented Feb 23, 2024

Problem

The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.

Namely, it doesn't SIMD parallelize hashing for small messages.

As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend:
https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3

When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.

image

I would like to contribute back my changes to this official library.
My code is Apache-2.0 licensed, so feel free to copy from it.

Suggested Changes

There are three major pieces required:

  1. Adapt the SIMD backends to do work off each lane independently, namely:
    • Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
    • Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
  2. Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
    • I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
    • When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
    • I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.
  3. Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states
    • The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
    • Another way is to use thread-local storage to keep track of streaming operations on the current thread
    s1 := Blake3::new()
    s2 := Blake3::new()
    s1.append("abcd");  // registers this append operation as a thread-local
    s2.append("1234"): 
    hash2 := s2.fini();  // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
    hash1 := s1.fini();  // no-op! the result is already available
    
@riptl riptl changed the title Consider multi-block batching Consider multi-message batching Feb 23, 2024
@recmo
Copy link

recmo commented Sep 2, 2024

Hashing many very short (< 128 byte) messages is a very important use-case for Merkle-trees and Proof-of-Work, both are major compute bottlenecks in transparent SNARKs (e.g. FRI/Ligero/Basefold like).

It looks like the docs-hidden api Platform::hash_many can support this? I will try this.

@ripatel-fd
Copy link

ripatel-fd commented Sep 5, 2024

@recmo Please note the hash_many API is used to hash multiple chunks, not multiple messages. Internally, hash_many distributes chunks across SIMD lanes for high byte-per-cycle throughput. A chunk is a leaf node in the BLAKE3 hash tree. Messages smaller than 1024 bytes have exactly one chunk. However, chunks are made up of a variable number of blocks (ceil(data_sz/64)). Finally, hash_many requires that each provided chunk has the same number of blocks.

So, hash_many will work for your use-case only if each message in your batch has the same number of 64-byte blocks (ceil(message_sz/64)). Because this is a lower-level API, you'd need to figure out the flags and counter arguments yourself from the BLAKE3 paper.

My BLAKE3 rewrite linked above is more flexible:

  • My equivalent of the hash_many function supports lane masking, allowing you to hash chunks with different number of blocks, e.g. if you have a 128 byte message and a 140 byte message in the same batch (costly under AVX2 but effectively zero cost under AVX512)
  • It provides a high-level API and generic scheduler for batch hashing for arbitrary message sizes where you don't have to hand-specify flags, making it useful for >1024 byte messages (the scheduling logic makes it a bit slower)

Happy to dust it off and get it up to speed if needed.

@ripatel-fd
Copy link

cc @zookozcash is there any renewed interest in supporting batched hashing of messages with different sizes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants