b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31

charleschege · 2020-01-17T04:46:26Z

When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3

oconnor663 · 2020-01-17T05:07:29Z

Hmm, there's a lot that's weird here. Just focusing on blake2, I'd expect your throughput to be almost a factor of 10 higher than it is. Are you able to reproduce this with multiple runs, when you've killed any other apps running on your machine? Could you tell us your exact processor, just in case?

charleschege · 2020-01-17T08:16:50Z

Cpu is 64bit x86_64-linux-gnu core i3-3217U @1.80GHz dual core with sse4 vector extensions and is little endian

charleschege · 2020-01-17T08:40:09Z

I have tried hashing both a zip file and a mkv format file and b3sum seems to be about 2.5 times slower than blake2b crates

charleschege · 2020-01-17T09:05:02Z

So I span up a cloud compute instance with 3.5GB RAM and 2 virtual Cpus with avx512 vector extensions and it appears that b3sum is faster than blake2b crates. I will have to troubleshoot this on my machine as it seems there might be a problem for my setup. I will close this issue for now. Thanks for looking into it

oconnor663 · 2020-01-17T12:52:59Z

To be fair, the difference between BLAKE2b and BLAKE3 will be a lot smaller using SSE4.1 than it will be using AVX2 or AVX-512. But at the same time, I'd expect b3sum to take advantage of your second core.

One thing that could be happening is that you don't have enough memory to fully cache the file you're testing. If you're actually reading disk rather than cache, that'll be the bottleneck. Maybe look at top to see how much memory you have "free/available", and pick a file size smaller than that.

charleschege · 2020-03-13T06:18:43Z

On doing another test based on #32 I disabled mutithreading with env as RAYON_NUM_THREAD=1 and it was faster than any other hash algorithm I used on my spinning disks expect blake2b. It would be nice to document the issues with spinning disks in the README.md of b3sum

gcp · 2020-03-13T12:00:30Z

I can confirm this. memmapping and splitting work over the entire file works fine for SSDs but it kills performance on spinning disks.

charleschege · 2020-03-13T13:02:03Z

Is there a feature flag that exists or can be worked on to help with spinning disks, or will blake3 speeds be SSD only hash function?

gcp · 2020-03-13T13:13:23Z

I'm using --no-mmap now as a workaround but obviously a proper fix would be better.

oconnor663 · 2020-03-13T13:13:49Z

Currently the --no--mmap flag has a similar effect to setting RAYON_NUM_THREADS=1. (The latter still uses memory mapping, so it might be a tad faster, but the access pattern will be totally serial.) Hashing standard input with < or | also has the same effect.

gcp · 2020-03-13T13:16:36Z

it might be a tad faster

Generally speaking mmaping is slower than a straightforward read because setting up the mappings has additional overhead.

YMMV depending on implementation details.

oconnor663 · 2020-03-13T13:30:48Z

setting up the mappings has additional overhead.

That's mainly an issue for small files, and b3sum uses a 16 KiB cutoff for that reason.

To be clear, the cases were talking about here are large files that don't fit in the page cache, so (once we've worked around the thrashing issue) performance is limited by disk bandwidth. I don't think memory mapping would make a measurable difference if we turned off multi-threading, but I haven't tested it yet.

oconnor663 · 2020-03-13T13:32:30Z

Perhaps a proper solution would be something like a --serial flag, which tells b3sum to always read the bytes of the file from front to back in order, but which otherwise doesn't preclude memory mapping?

charleschege · 2020-03-13T13:45:14Z

Solving this issue would be great since one would have to choose blake2 for desktop users as there is no way to know whether the desktop user has an SSD or spinning disk and then choose blake3 when servers are in use since most of them are SSD disks.

oconnor663 · 2020-03-13T14:15:58Z

A few things worth clarifying here:

This isn't a BLAKE3 issue per se. This is a b3sum issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it.
If you want to force serial hashing, without assuming anything about implementation behavior, I suppose you could always do cat file | b3sum.
A heuristic like "servers don't have spinning disks" doesn't sound very reliable. It depends on the purpose of the server.
Most files fit in the page cache. This issue only applies to files that are so large that they don't fit. That makes this issue relatively more likely to appear in manual benchmarks than in production applications. To give specific advice about how to avoid the issue in production, it probably makes sense to talk about a specific application. (Do you control the hardware it runs on? Is hashing speed a bottleneck? Etc.)

gcp · 2020-03-13T15:12:18Z

Most files fit in the page cache. This issue only applies to files that are so large that they don't fit.

This doesn't appear to be true. I have a 32GB RAM workstation, I'm checksumming about 24TB of data, and there's only a few files that are larger than say 10GB. Performance tanks without the workaround.

The page cache is likely irrelevant because the files are not in cache and the current b3sum code starts reading them in parallel from a disk that seeks slowly. It would likely have worked better if it did a read in serial, and then tried to process them in parallel, but that's not what it does. (And yes, I know you can't do this in general, just explaining why the page cache is no help against the thrashing)

charleschege · 2020-03-13T16:38:31Z

A few things worth clarifying here:

* This isn't a BLAKE3 issue per se. This is a `b3sum` issue. BLAKE3 library code doesn't use multithreading unless you tell it to, for exactly this sort of reason. The performance tradeoffs are complex, and the library doesn't know enough to make a good decision. Instead, the caller needs to decide whether the tradeoffs are worth it.

* If you want to force serial hashing, without assuming anything about implementation behavior, I suppose you could always do `cat file | b3sum`.

* A heuristic like "servers don't have spinning disks" doesn't sound very reliable. It depends on the purpose of the server.

* Most files fit in the page cache. This issue only applies to files that are so large that they don't fit. That makes this issue relatively more likely to appear in manual benchmarks than in production applications. To give specific advice about how to avoid the issue in production, it probably makes sense to talk about a specific application. (Do you control the hardware it runs on? Is hashing speed a bottleneck? Etc.)

If the blake3 library doesn't have the issue then that is fine as I am likely to care more about using the library instead of the binary in production. Thanks for clarifying. The b3sum is still an issue though and it should be looked at.

Even though I am more likely to use the library in production, I will leave the issue open so that there can be a point of reference for b3sum.

oconnor663 · 2020-03-13T17:27:20Z

Linux has the fincore utility, based on the mincore libc function. It looks like the way you use it is that you memory map a file (or some window of it, in a loop), and that function iterates over every page and tells you which ones are in memory. That sounds expensive to me, but I haven't experimented with it yet. Maybe it could be sped up with e.g. random sampling, but that starts to sound complicated. Even if it works in a satisfactory way on Linux, mincore is not a Posix function, and I don't know how well it's supported elsewhere. There's also Windows to think about.

Without an expert opinion to tell me "actually this strategy has worked well in practice for other tools", my guess is that we won't be able to detect caching reliably (not to mention cheaply) in all cases, and that the complexity of integrating all this into b3sum will not be worth it.

oconnor663 · 2020-03-13T17:33:10Z

Just taking a guess at one of the complexities involved: An "in cache" result would be inherently racy. It's possible for a file to drop out of cache in between when the result is returned and when hashing begins, or while hashing is running. Callers relying on b3sum to be smart about their spinning disks might run into this racy performance cliff in rare cases. (E.g. when the updatedb cronjob happens to kick on.)

sneves · 2020-03-13T17:38:19Z

If increasing complexity is on the table, it would seem more profitable to implement the threading on big sequential blocks of data, instead of spreading the accesses around the file.

oconnor663 · 2020-03-13T18:28:04Z

That's a good point. It's possible that we won't be able to get a from-the-front multithreading implementation all the way up to the performance of the recursive one, in which case there could still be some benefit to keeping the recursive codepath in b3sum, but I'm not sure.

baryluk · 2020-05-27T15:47:50Z

Maybe asking Linux VFS people to provide an ioctl to query if the file is on a device that has rotational bit set or not is another option? (Or return maximal recommended IO parallelism and block size hints for it). Kernel knows it, and if file system is using single device (or multiple same-typed device) it can propagate this up and make available to the user-space (and stacked file systems). Not sure what it should answer for networked or fuse file systems tho, as that can be rather opaque.

oconnor663 · 2020-05-27T16:17:57Z

That sounds above my pay grade :)

I think the idea that sneves described above is a good candidate for solving this on all platforms. We need expose a way for N threads to work together hashing large-ish blocks at the front of the file, rather than dividing-and-conquering the file (and therefore sending some threads to the back of it). This is necessary to get any multithreading working in a use case like cat foo | b3sum, where seeking isn't possible in the first place, and it will have the side effect of avoiding this thrashing issue on spinning disks. Open questions: 1) How much of this needs to be supported at the library crate level? 2) Will the performance be as good as before in the common case, or would this be another tradeoff for the application to consider?

joshtriplett · 2020-06-07T13:41:04Z

If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.

oconnor663 · 2020-06-10T16:34:03Z

If it's possible, I would love to see an implementation that can handle streaming data and still usefully use parallelism.

Yes, it would be very nice for cat file | b3sum, among many other things. If I can get it to, say, 80+% of the performance of the memory mapping version, I think it should be the default. I expect it's going to be doable, though now that I'm gainfully employed I'll need to wait for a clear weekend to tackle it :)

(For what it's worth, the current implementation does take advantage of SIMD parallelism in its single-threaded/streaming mode.)

lefth · 2021-05-23T19:03:10Z

I have worked around the problem in toy version of b3sum I wrote for testing, so I'd like to help if possible. Is it too simple of a solution to repeatedly call update_with_join on a slice of a few megabytes at a time rather than calling it on one large buffer? In practice it works well (though I'm using a buffer, not mmap). I don't know enough to rewrite update_with_join to operate on sequential blocks, but updating the caller is easy.

lefth · 2021-05-24T07:02:18Z

Thoughts about how to calculate an optimal chunk size would be appreciated. The change in the PR uses 4 MB chunks.

DemiMarie · 2021-11-01T10:03:38Z

The fastest method of disk I/O is platform-specific: io_uring on Linux, IoRing on Windows, and kernel AIO on FreeBSD. I am not sure what it is for macOS or the other BSDs. In all cases one should use O_DIRECT to avoid thrashing the page cache.

oconnor663 · 2021-11-19T23:58:49Z

My current branch for this issue is: https://github.com/BLAKE3-team/BLAKE3/tree/mmap_fromthefront

It quickly became clear that a regular Read::read() loop was a bottleneck, when the reads were actually coming from RAM, and when there were many worker threads to feed. That's no problem when the reads are actually coming from disk, but I want an approach that works well for both, so that we don't need OS magic to tell us which one we're about to do. My current approach is mmap-based.

jarppiko · 2023-03-05T15:53:15Z

Did I understand it correctly that without --no-mmap option b3sum read the whole file into memory before calculating checksum? I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.

gcp · 2023-03-05T17:09:29Z

Did I understand it correctly that without --no-mmap option b3sum read the whole file into memory before calculating checksum?

No. mmap() does not "read the file into memory". It presents the file as if it were a memory range, but if you only access the first byte, only the first byte will ever be read from disk (I guess technically the first 4K, but...)

I was wondering why I get 10GB+ memory consumption when checking checksums of very large files.

"memory consumption" is a vague term that doesn't really mean anything. What are you actually measuring and seeing?

If you mmap() a 10G file, 10G of the address space will be taken up. But this consumes 0 bytes of physical memory.

jarppiko · 2023-03-06T06:54:20Z

Apologies for being vague here. This got my attention when the VIRT, RES and SHR values in Linux htop were on the high side (see below). But now I realized that the total memory usage (by all processes) did not increase significantly so the values seem to include cached data.

BLAKE3 has poor performance on spinning disks when parallelized. See BLAKE3-team/BLAKE3#31 - Replace `skip_model_hash` setting with `hashing_algorithm`. Any algorithm we support is accepted. - Add `random` algorithm: hashes a UUID with BLAKE3 to create a random "hash". Equivalent to the previous skip functionality. - Add `blake3_single` algorithm: hashes on a single thread using BLAKE3, fixes the aforementioned performance issue - Update model probe to accept the algorithm to hash with as an optional arg, defaulting to `blake3` - Update all calls of the probe to use the app's configured hashing algorithm - Update an external script that probes models - Update tests - Move ModelHash into its own module to avoid circuclar import issues

charleschege closed this as completed Jan 17, 2020

charleschege reopened this Mar 13, 2020

oconnor663 changed the title ~~Poor speed~~ b3sum has performance for large files on spinning disks, when multi-threading is enabled Mar 13, 2020

oconnor663 changed the title ~~b3sum has performance for large files on spinning disks, when multi-threading is enabled~~ b3sum has poor performance for large files on spinning disks, when multi-threading is enabled Mar 13, 2020

oconnor663 mentioned this issue Aug 18, 2020

very slow using same way of reading file (blake seconds) vs ( sha256 ms) oconnor663/blake2_simd#22

Closed

lefth linked a pull request May 24, 2021 that will close this issue

Compute threaded checksums of chunks, iteratively #175

Open

oconnor663 mentioned this issue Jun 4, 2021

Performance issues on windows. #178

Open

oconnor663 mentioned this issue Jul 27, 2021

Use case for multithreaded hashing in C #184

Open

oconnor663 mentioned this issue Oct 24, 2021

Suggestion: progress percentage when calculating hashes in b3sum #205

Open

Rudxain mentioned this issue Jul 18, 2022

Multithreading Rudxain/xorsum#4

Closed

psychedelicious mentioned this issue Mar 13, 2024

feat(mm): faster hashing for spinning disk HDDs invoke-ai/InvokeAI#5952

Merged

12 tasks

oconnor663 mentioned this issue Aug 20, 2024

Slow HDD read. #390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31

charleschege commented Jan 17, 2020

oconnor663 commented Jan 17, 2020

charleschege commented Jan 17, 2020

charleschege commented Jan 17, 2020

charleschege commented Jan 17, 2020

oconnor663 commented Jan 17, 2020

charleschege commented Mar 13, 2020 •

edited

Loading

gcp commented Mar 13, 2020

charleschege commented Mar 13, 2020

gcp commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

gcp commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

oconnor663 commented Mar 13, 2020 •

edited

Loading

charleschege commented Mar 13, 2020

oconnor663 commented Mar 13, 2020 •

edited

Loading

gcp commented Mar 13, 2020 •

edited

Loading

charleschege commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

sneves commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

baryluk commented May 27, 2020 •

edited

Loading

oconnor663 commented May 27, 2020

joshtriplett commented Jun 7, 2020

oconnor663 commented Jun 10, 2020

lefth commented May 23, 2021 •

edited

Loading

lefth commented May 24, 2021

DemiMarie commented Nov 1, 2021 •

edited

Loading

oconnor663 commented Nov 19, 2021 •

edited

Loading

jarppiko commented Mar 5, 2023

gcp commented Mar 5, 2023

jarppiko commented Mar 6, 2023

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled #31

Comments

charleschege commented Jan 17, 2020

oconnor663 commented Jan 17, 2020

charleschege commented Jan 17, 2020

charleschege commented Jan 17, 2020

charleschege commented Jan 17, 2020

oconnor663 commented Jan 17, 2020

charleschege commented Mar 13, 2020 • edited Loading

gcp commented Mar 13, 2020

charleschege commented Mar 13, 2020

gcp commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

gcp commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

oconnor663 commented Mar 13, 2020 • edited Loading

charleschege commented Mar 13, 2020

oconnor663 commented Mar 13, 2020 • edited Loading

gcp commented Mar 13, 2020 • edited Loading

charleschege commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

sneves commented Mar 13, 2020

oconnor663 commented Mar 13, 2020

baryluk commented May 27, 2020 • edited Loading

oconnor663 commented May 27, 2020

joshtriplett commented Jun 7, 2020

oconnor663 commented Jun 10, 2020

lefth commented May 23, 2021 • edited Loading

lefth commented May 24, 2021

DemiMarie commented Nov 1, 2021 • edited Loading

oconnor663 commented Nov 19, 2021 • edited Loading

jarppiko commented Mar 5, 2023

gcp commented Mar 5, 2023

jarppiko commented Mar 6, 2023

charleschege commented Mar 13, 2020 •

edited

Loading

oconnor663 commented Mar 13, 2020 •

edited

Loading

oconnor663 commented Mar 13, 2020 •

edited

Loading

gcp commented Mar 13, 2020 •

edited

Loading

baryluk commented May 27, 2020 •

edited

Loading

lefth commented May 23, 2021 •

edited

Loading

DemiMarie commented Nov 1, 2021 •

edited

Loading

oconnor663 commented Nov 19, 2021 •

edited

Loading