-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run arc_evict thread at higher priority #12397
Conversation
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
can this be tunable? |
This change does not make arc_evict thread priority a tunable as it's not clear thread priority should be a tunable. On Linux, the |
I think this is shippable, though I also suspect that this is working around an underlying implementation issue, as thread priority shouldn't be such a big issue unless (obviously...) there are so many threads in a runnable state that the scheduler has to actually make a decision over who gets to run. Really really often. And if the contention isn't coming from elsewhere (i.e. userspace or a dumb driver), then zfs threads are fighting zfs threads for runtime; when you say Perhaps this is a victim of one the various dragons in arc_wait_for_eviction() and the actual eviction implementation? TMI: For example, the old ghost list accounting (fixed a week ago) would make us think we evicted more than we did, aggravating an async ARC_OVF_SOME condition into a ARC_OVF_SEVERE condition which would block read threads while waiting for eviction, and (still an outstanding issue?) another issue would make arc_evict() want to evict more than we need to (arc_evict_meta_balanced() can evict both metadata and data, but arc_evict() evaluates its ARC targets based on the arc data+metadata sizes from before it calls arc_evict_meta(), causing unnecessary eviction and of course unnecessary extra work). TL;DR: yeah, this is probably working around some other bugs. :) |
While we probably can improve more things around ARC reclamation in addition to what is already done, it is clear to me that single arc_evict thread is more prone to CPU bottlenecks than lets say dozen(s) of ZIO threads. Also that thread does not create more load than necessary, it only follows the demand. And if it can't keep up, it slows down all other threads, no matter what is their priority. (Thinking about this, some sort of priority propagation could be interesting...) From that perspective higher priority for this thread makes sense to me. The only thing we may wish to review is specific priority levels used used for all the different theads on different OS. |
The change here gives arc_evict thread more CPU time when there's high eviction demand, e.g. heavy I/O load and working set size is much larger than ARC size. We've seen cases where most NFS requests were waiting on arc_wait_for_eviction(). On system with light load, I agree this change may not help. Yes, there's likely other ways to improve ARC evictions though they're orthogonal, IMO. Compression enabled and no encryption. In the case of 0% idle, most of the time was in iowait and that 1 CPU test shows there isn't any gain if there are no available CPU cycles. |
Umm, priority propagation, that's interesting. Also I'm curious if different hardware profiles or loads would be better served with different thread priorities. |
We sort of have it indirectly for dbuf cache reclaim, since each separate thread can help with the reclamation if the dedicated thread can't keep up. But ARC reclamation is more complicated due to different types of caches (mfu/mru, data/metadata). If we could dramatically simplify that somehow... |
@behlendorf any special reason why this PR is not in accepted state yet? |
@mmaybee has been shepherding this PR, but it looks ready to merge to me. |
wanna mention about thread priority change that I forgot to follow up years ago the old priority value around this is
from this value my guess about Solaris ZFS stays in probably shouldn't left just my two cents |
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes openzfs#12397
This was #12151. I have a new repo and couldn't update
that PR.
Motivation and Context
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.
On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.
Description
Extending zthr_create() and zthr_create_timer() to take a new priority_t
argument and arc_evict_zthr is created with defclsyspri (nice=0) priority
value rather than the current minimum priority (nice=19).
Observed performance gains with nice=0 and nice=-20 comparing to current nice=19
VM config: 16 vCPU, 32 GB RAM, 4 x 200 GB disks
Workload: 32 threads x 1M sequential reads
minclsyspri (nice 19) current value: 1301 IOPS, 1365 MB/s, 37-45% idle
defclsyspri (nice 0): 1860 IOPS, 1951 MB/s, 22-25% idle
maxclsyspri (nice -20): 1929 IOPS, 2023 MB/s, 20-22% idle
====================
VM config: 8 vCPU, 32 GB RAM, 4 x 200 GB disks
Workload: 32 threads x 1M sequential reads
minclsyspri (nice 19) current value: 915 IOPS, 960 MB/s, 8-12% idle
defclsyspri (nice 0): 1135 IOPS, 1191 MB/s, 6-10% idle
maxclsyspri (nice -20): 1455 IOPS, 1527 MB/s, % idle 1-1.5% idle
====================
VM config: 1 vCPU, 24 GB RAM, 4 x 200 GB disks
Workload: 16 threads x 1M sequential reads
minclsyspri (nice 19) current value: 251 IOPS, 251 MB/s, 0% idle
defclsyspri (nice 0): 251 IOPS, 251 MB/s, 0% idle
maxclsyspri (nice -20): 240 IOPS, 240 MB/s, 0% idle
How Has This Been Tested?
Manual tests with sequential reads and mixed read/writes with larger pool size.
ZFS performance test suite also show better though smaller gains for
sequential read and sequential writes, up to 8% and 3%, respectively.
Test setup: 8 vCPU, 32 GB memory, and 4 x 100 SSD disks
Types of changes
Checklist:
Signed-off-by
.