Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cmultithreading #343

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

pantoniou
Copy link

This PR provides multithreading for the C-implementation.
It is quite minimal, but it gets the job done of getting C to withing 1% of the Rust implementation.

We can get rid of uneeded memsets when clearing state, since
the count is always reset.

We can clear the hasher state on init if we need to start from
zeroes.

Signed-off-by: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
This is a minimal thread-pool implementation for BLAKE3.

As part of the commit a very simple file mmap-ing example is
introduced that providing file checksuming.

At least on my machines (x86-64 linux boxes) the C implementation
is within a 1% point in performance of the rust b3sum tool.

> $ ls -l ggml-vicuna-7b-1.1-q4_0.bin
-rw-rw-r-- 1 panto panto 3791725184 Ιουν 17 13:34 ggml-vicuna-7b-1.1-q4_0.bin

> $ time b3sum ggml-vicuna-7b-1.1-q4_0.bin
> fe87fecc4427c4661aa46f52a7e286ecd6dd7f5f788564abb2c46d2ed5341584  ggml-vicuna-7b-1.1-q4_0.bin
>
> real	0m0,190s
> user	0m1,583s
> sys	0m0,196s

> $ time ./example-mmap ggml-vicuna-7b-1.1-q4_0.bin
> fe87fecc4427c4661aa46f52a7e286ecd6dd7f5f788564abb2c46d2ed5341584  ggml-vicuna-7b-1.1-q4_0.bin
>
> real	0m0,204s
> user	0m1,525s
> sys	0m0,247s

At my local copy of BLAKE3 with more extensive changes I can get the C
implementation to be slightly faster than the rust one.
But those require more extensive changes which will have to come much
later.

Signed-off-by: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
When buffer sizes are small in streaming mode the partial output
must have the rest of the buffer filled with zeroes.

Signed-off-by: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
@oconnor663
Copy link
Member

Thank you for doing all this work! It might be a week or two before I have enough time to review it properly, but I'll be very interested to understand how it all works. Is it architecturally similar to Rayon, or something totally different?

I should be honest that I don't know if we're going to want to maintain this much threading machinery inside the project. One thing I'll want to think about is whether we could expose APIs that would make it possible to implement something like this on top of our code.

@pantoniou
Copy link
Author

Don't mention it. I had a need for a fast hash and BLAKE3 fits the bill.

I am not familiar with Rayon but from what I can understand from the exposed API the threading method is exactly the same.
I.e. the threading wait model.

Well, I wouldn't worry about it too much, since I'm maintaining a full blown C project with all the bells and whistles of detecting compiler versions and blake3 backends already.

Let me know if there's something I can do to make things easy for you guys to make next BLAKE3 version easier to drop in my codebase.

For example I could parsel out the BLAKE3 support as a standalone library that people could use like normal (i.e. apt-get install libcblake3) but I don't know if that's something you would be fine with. There's the small nit that I'm not much of a windows person so maybe target Linux/MacOS/unices first.

@pantoniou
Copy link
Author

FYI, I have finally released the blake3 integrated fyaml, you could take a look at the integrated work starting at pantoniou/libfyaml@140e762

@nirs
Copy link

nirs commented Oct 3, 2023

I think it will be more useful to provide a C API to the rust code so it can be used as a library instead of maintaining duplicate implementations.

@BurningEnlightenment
Copy link
Collaborator

This is notably missing Windows support. Furthermore I find it very intrusive if libraries spin up their own thread pools as there are machines with hundreds of cores out there and PIds are very much a limited resource, i.e. this can provoke system resource exhaustion events for some people.

@lelik107
Copy link

lelik107 commented Oct 13, 2024

@oconnor663 My only question here is - are you going to multithread the С code to the end of the World? Something is better than nothing! You could add OMP dependency and mark it as experimental and and than much more people would test how it actually works or it's a bad idea. But there nothing even similar I see in the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants