-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31
Comments
Hmm, there's a lot that's weird here. Just focusing on |
Cpu is 64bit x86_64-linux-gnu core i3-3217U @1.80GHz dual core with sse4 vector extensions and is little endian |
I have tried hashing both a zip file and a mkv format file and b3sum seems to be about 2.5 times slower than blake2b crates |
So I span up a cloud compute instance with 3.5GB RAM and 2 virtual Cpus with avx512 vector extensions and it appears that b3sum is faster than blake2b crates. I will have to troubleshoot this on my machine as it seems there might be a problem for my setup. I will close this issue for now. Thanks for looking into it |
To be fair, the difference between BLAKE2b and BLAKE3 will be a lot smaller using SSE4.1 than it will be using AVX2 or AVX-512. But at the same time, I'd expect One thing that could be happening is that you don't have enough memory to fully cache the file you're testing. If you're actually reading disk rather than cache, that'll be the bottleneck. Maybe look at |
On doing another test based on #32 I disabled mutithreading with env as RAYON_NUM_THREAD=1 and it was faster than any other hash algorithm I used on my spinning disks expect blake2b. It would be nice to document the issues with spinning disks in the README.md of b3sum |
I can confirm this. memmapping and splitting work over the entire file works fine for SSDs but it kills performance on spinning disks. |
Is there a feature flag that exists or can be worked on to help with spinning disks, or will blake3 speeds be SSD only hash function? |
I'm using --no-mmap now as a workaround but obviously a proper fix would be better. |
Currently the |
Generally speaking mmaping is slower than a straightforward read because setting up the mappings has additional overhead. YMMV depending on implementation details. |
That's mainly an issue for small files, and To be clear, the cases were talking about here are large files that don't fit in the page cache, so (once we've worked around the thrashing issue) performance is limited by disk bandwidth. I don't think memory mapping would make a measurable difference if we turned off multi-threading, but I haven't tested it yet. |
Perhaps a proper solution would be something like a |
Solving this issue would be great since one would have to choose blake2 for desktop users as there is no way to know whether the desktop user has an SSD or spinning disk and then choose blake3 when servers are in use since most of them are SSD disks. |
A few things worth clarifying here:
|
This doesn't appear to be true. I have a 32GB RAM workstation, I'm checksumming about 24TB of data, and there's only a few files that are larger than say 10GB. Performance tanks without the workaround. The page cache is likely irrelevant because the files are not in cache and the current |
If the blake3 library doesn't have the issue then that is fine as I am likely to care more about using the library instead of the binary in production. Thanks for clarifying. The b3sum is still an issue though and it should be looked at. Even though I am more likely to use the library in production, I will leave the issue open so that there can be a point of reference for b3sum. |
Linux has the Without an expert opinion to tell me "actually this strategy has worked well in practice for other tools", my guess is that we won't be able to detect caching reliably (not to mention cheaply) in all cases, and that the complexity of integrating all this into |
Just taking a guess at one of the complexities involved: An "in cache" result would be inherently racy. It's possible for a file to drop out of cache in between when the result is returned and when hashing begins, or while hashing is running. Callers relying on |
If increasing complexity is on the table, it would seem more profitable to implement the threading on big sequential blocks of data, instead of spreading the accesses around the file. |
That's a good point. It's possible that we won't be able to get a from-the-front multithreading implementation all the way up to the performance of the recursive one, in which case there could still be some benefit to keeping the recursive codepath in |
Maybe asking Linux VFS people to provide an |
That sounds above my pay grade :) I think the idea that sneves described above is a good candidate for solving this on all platforms. We need expose a way for N threads to work together hashing large-ish blocks at the front of the file, rather than dividing-and-conquering the file (and therefore sending some threads to the back of it). This is necessary to get any multithreading working in a use case like |
If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism. |
Yes, it would be very nice for (For what it's worth, the current implementation does take advantage of SIMD parallelism in its single-threaded/streaming mode.) |
I have worked around the problem in toy version of b3sum I wrote for testing, so I'd like to help if possible. Is it too simple of a solution to repeatedly call |
Thoughts about how to calculate an optimal chunk size would be appreciated. The change in the PR uses 4 MB chunks. |
The fastest method of disk I/O is platform-specific: |
My current branch for this issue is: https://github.com/BLAKE3-team/BLAKE3/tree/mmap_fromthefront It quickly became clear that a regular |
Did I understand it correctly that without |
No.
"memory consumption" is a vague term that doesn't really mean anything. What are you actually measuring and seeing? If you |
BLAKE3 has poor performance on spinning disks when parallelized. See BLAKE3-team/BLAKE3#31 - Replace `skip_model_hash` setting with `hashing_algorithm`. Any algorithm we support is accepted. - Add `random` algorithm: hashes a UUID with BLAKE3 to create a random "hash". Equivalent to the previous skip functionality. - Add `blake3_single` algorithm: hashes on a single thread using BLAKE3, fixes the aforementioned performance issue - Update model probe to accept the algorithm to hash with as an optional arg, defaulting to `blake3` - Update all calls of the probe to use the app's configured hashing algorithm - Update an external script that probes models - Update tests - Move ModelHash into its own module to avoid circuclar import issues
BLAKE3 has poor performance on spinning disks when parallelized. See BLAKE3-team/BLAKE3#31 - Replace `skip_model_hash` setting with `hashing_algorithm`. Any algorithm we support is accepted. - Add `random` algorithm: hashes a UUID with BLAKE3 to create a random "hash". Equivalent to the previous skip functionality. - Add `blake3_single` algorithm: hashes on a single thread using BLAKE3, fixes the aforementioned performance issue - Update model probe to accept the algorithm to hash with as an optional arg, defaulting to `blake3` - Update all calls of the probe to use the app's configured hashing algorithm - Update an external script that probes models - Update tests - Move ModelHash into its own module to avoid circuclar import issues
BLAKE3 has poor performance on spinning disks when parallelized. See BLAKE3-team/BLAKE3#31 - Replace `skip_model_hash` setting with `hashing_algorithm`. Any algorithm we support is accepted. - Add `random` algorithm: hashes a UUID with BLAKE3 to create a random "hash". Equivalent to the previous skip functionality. - Add `blake3_single` algorithm: hashes on a single thread using BLAKE3, fixes the aforementioned performance issue - Update model probe to accept the algorithm to hash with as an optional arg, defaulting to `blake3` - Update all calls of the probe to use the app's configured hashing algorithm - Update an external script that probes models - Update tests - Move ModelHash into its own module to avoid circuclar import issues
When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3
The text was updated successfully, but these errors were encountered: