Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Suggest limiting core count on very-multicore machines to avoid kernel bug? #91

Closed
charles-dyfis-net opened this issue Nov 21, 2018 · 7 comments

Comments

@charles-dyfis-net
Copy link

On a 2-socket, 48-core system I've consistently had my I/O subsystem irrecoverably hang in less than 24 hours of operation; reproduced both with kernel 4.14.78 and 4.18.16. Using --thread-count 4 makes this go away.

Perhaps we should either:

  • Prominently advise using --thread-count on very-multicore hardware in the "kernel bugs" wiki page
  • Have a (default?) maximum number of threads limiting the effects of --thread-factor
@Zygo
Copy link
Owner

Zygo commented Nov 21, 2018

4.14.81 and 4.19.2 contain a lot of btrfs fixes which might help...but more likely won't.

I'm OK with a doc change.

@kakra
Copy link
Contributor

kakra commented Nov 21, 2018

Does it make sense to have so much IO threads running? Should there be an upper limit? I wonder if there's any sense in having more than disk_count to 2*disk_count threads running for IO. Btrfs is assigning stripes by PID modulo disk_count currently, so with an upper bound of 2x the disk count, we are probably putting IO to all disks at once already.

But probably still makes sense for the number crunching threads like hashing...

@Zygo
Copy link
Owner

Zygo commented Nov 21, 2018

The IO threads are mixed in with the hashing threads, so it's a bit of a mess at the moment. To really get IO and hashing usefully separated we'd need to rewrite most of the code, implement multiple distinct thread pools and a scheduler, and we'd need one scheduler for spinning disks and a different one for SSDs. And then it would all go to hell if we ever found a match for anything (suddenly we need locks on multiple filesystem objects and disks, probably just end up effectively single-threaded across the entire filesystem).

Experimentally I've found bees goes a little faster if the worker thread count is higher than the disk count, but no faster (maybe even a little slower) if the worker thread count is higher than the CPU core count (at least for the first 8 cores).

I've also found that things like limiting the number of threads executing dedupe or LOGICAL_INO ioctls might help with system stability (though the benefit is small compared to noise in my test environment). Perhaps more --workaround-* options are in order for performance-vs-danger tradeoffs.

I can put in a soft limit, so --thread-factor would use no more than 8 cores. --thread-count would still let the user use any number they liked.

@kakra
Copy link
Contributor

kakra commented Nov 21, 2018

So could a number like (disk_count+hardware_concurrency)/2 be good heuristic? For the system in example it would still create at least 12 threads, wondering if that would be stable then?

@Zygo
Copy link
Owner

Zygo commented Nov 21, 2018

I wouldn't try to guess without running a lot of performance experiments on specific hardware configurations. Even if we did that, changing the bees code could instantly invalidate all that data. There are huge gains still possible from relatively small code changes, and I have big code changes planned too.

The number of workers is configurable, and the default (after adding a soft limit for people with huge multi-socket systems) works OK. Users who know better can change it or test assorted values.

Zygo pushed a commit that referenced this issue Nov 22, 2018
#91 describes problems encountered
when running bees on systems with many CPU cores.

Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop).  Users can override the limit by using --thread-count.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
@Zygo
Copy link
Owner

Zygo commented Dec 11, 2018

We have a core-count limit (the second option in the original issue). Can we close this?

@charles-dyfis-net
Copy link
Author

I certainly consider it fully addressed.

kakra pushed a commit to kakra/bees that referenced this issue Oct 29, 2019
Zygo#91 describes problems encountered
when running bees on systems with many CPU cores.

Limit the computed number of threads (using --thread-factor or the
default) to a maximum of 8 (i.e. the number of logical cores in a modern
laptop).  Users can override the limit by using --thread-count.

Signed-off-by: Zygo Blaxell <bees@furryterror.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants