Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run arc_evict thread at higher priority #12397

Merged
merged 1 commit into from
Aug 10, 2021
Merged

Run arc_evict thread at higher priority #12397

merged 1 commit into from
Aug 10, 2021

Conversation

tonynguien
Copy link
Contributor

This was #12151. I have a new repo and couldn't update
that PR.

Motivation and Context

Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Description

Extending zthr_create() and zthr_create_timer() to take a new priority_t
argument and arc_evict_zthr is created with defclsyspri (nice=0) priority
value rather than the current minimum priority (nice=19).

Observed performance gains with nice=0 and nice=-20 comparing to current nice=19

VM config: 16 vCPU, 32 GB RAM, 4 x 200 GB disks
Workload: 32 threads x 1M sequential reads
minclsyspri (nice 19) current value: 1301 IOPS, 1365 MB/s, 37-45% idle
defclsyspri (nice 0): 1860 IOPS, 1951 MB/s, 22-25% idle
maxclsyspri (nice -20): 1929 IOPS, 2023 MB/s, 20-22% idle

====================
VM config: 8 vCPU, 32 GB RAM, 4 x 200 GB disks
Workload: 32 threads x 1M sequential reads
minclsyspri (nice 19) current value: 915 IOPS, 960 MB/s, 8-12% idle
defclsyspri (nice 0): 1135 IOPS, 1191 MB/s, 6-10% idle
maxclsyspri (nice -20): 1455 IOPS, 1527 MB/s, % idle 1-1.5% idle

====================
VM config: 1 vCPU, 24 GB RAM, 4 x 200 GB disks
Workload: 16 threads x 1M sequential reads
minclsyspri (nice 19) current value: 251 IOPS, 251 MB/s, 0% idle
defclsyspri (nice 0): 251 IOPS, 251 MB/s, 0% idle
maxclsyspri (nice -20): 240 IOPS, 240 MB/s, 0% idle

How Has This Been Tested?

Manual tests with sequential reads and mixed read/writes with larger pool size.

ZFS performance test suite also show better though smaller gains for
sequential read and sequential writes, up to 8% and 3%, respectively.
Test setup: 8 vCPU, 32 GB memory, and 4 x 100 SSD disks

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
@tonynguien tonynguien requested review from amotin and behlendorf July 19, 2021 21:52
@tonynguien tonynguien added the Status: Code Review Needed Ready for review and testing label Jul 19, 2021
@bghira
Copy link

bghira commented Jul 19, 2021

can this be tunable?

@tonynguien
Copy link
Contributor Author

can this be tunable?

This change does not make arc_evict thread priority a tunable as it's not clear thread priority should be a tunable.

On Linux, the renice command would change the priority on the fly if someone wants to experiment and find the optimal value. Perhaps, we'll have more collective data to determine whether it's worth making it a tunable.

@adamdmoss
Copy link
Contributor

I think this is shippable, though I also suspect that this is working around an underlying implementation issue, as thread priority shouldn't be such a big issue unless (obviously...) there are so many threads in a runnable state that the scheduler has to actually make a decision over who gets to run.

Really really often.

And if the contention isn't coming from elsewhere (i.e. userspace or a dumb driver), then zfs threads are fighting zfs threads for runtime; when you say 0% idle have you measured where the time is being spent? Does this include iowait time? (Is compression and encryption in the mix?)

Perhaps this is a victim of one the various dragons in arc_wait_for_eviction() and the actual eviction implementation? TMI: For example, the old ghost list accounting (fixed a week ago) would make us think we evicted more than we did, aggravating an async ARC_OVF_SOME condition into a ARC_OVF_SEVERE condition which would block read threads while waiting for eviction, and (still an outstanding issue?) another issue would make arc_evict() want to evict more than we need to (arc_evict_meta_balanced() can evict both metadata and data, but arc_evict() evaluates its ARC targets based on the arc data+metadata sizes from before it calls arc_evict_meta(), causing unnecessary eviction and of course unnecessary extra work).

TL;DR: yeah, this is probably working around some other bugs. :)

@amotin
Copy link
Member

amotin commented Jul 21, 2021

While we probably can improve more things around ARC reclamation in addition to what is already done, it is clear to me that single arc_evict thread is more prone to CPU bottlenecks than lets say dozen(s) of ZIO threads. Also that thread does not create more load than necessary, it only follows the demand. And if it can't keep up, it slows down all other threads, no matter what is their priority. (Thinking about this, some sort of priority propagation could be interesting...) From that perspective higher priority for this thread makes sense to me. The only thing we may wish to review is specific priority levels used used for all the different theads on different OS.

@tonynguien
Copy link
Contributor Author

I think this is shippable, though I also suspect that this is working around an underlying implementation issue, as thread priority shouldn't be such a big issue unless (obviously...) there are so many threads in a runnable state that the scheduler has to actually make a decision over who gets to run.

Really really often.

And if the contention isn't coming from elsewhere (i.e. userspace or a dumb driver), then zfs threads are fighting zfs threads for runtime; when you say 0% idle have you measured where the time is being spent? Does this include iowait time? (Is compression and encryption in the mix?)

Perhaps this is a victim of one the various dragons in arc_wait_for_eviction() and the actual eviction implementation? TMI: For example, the old ghost list accounting (fixed a week ago) would make us think we evicted more than we did, aggravating an async ARC_OVF_SOME condition into a ARC_OVF_SEVERE condition which would block read threads while waiting for eviction, and (still an outstanding issue?) another issue would make arc_evict() want to evict more than we need to (arc_evict_meta_balanced() can evict both metadata and data, but arc_evict() evaluates its ARC targets based on the arc data+metadata sizes from before it calls arc_evict_meta(), causing unnecessary eviction and of course unnecessary extra work).

TL;DR: yeah, this is probably working around some other bugs. :)

The change here gives arc_evict thread more CPU time when there's high eviction demand, e.g. heavy I/O load and working set size is much larger than ARC size. We've seen cases where most NFS requests were waiting on arc_wait_for_eviction(). On system with light load, I agree this change may not help.

Yes, there's likely other ways to improve ARC evictions though they're orthogonal, IMO.

Compression enabled and no encryption. In the case of 0% idle, most of the time was in iowait and that 1 CPU test shows there isn't any gain if there are no available CPU cycles.

@tonynguien
Copy link
Contributor Author

While we probably can improve more things around ARC reclamation in addition to what is already done, it is clear to me that single arc_evict thread is more prone to CPU bottlenecks than lets say dozen(s) of ZIO threads. Also that thread does not create more load than necessary, it only follows the demand. And if it can't keep up, it slows down all other threads, no matter what is their priority. (Thinking about this, some sort of priority propagation could be interesting...) From that perspective higher priority for this thread makes sense to me. The only thing we may wish to review is specific priority levels used used for all the different theads on different OS.

Umm, priority propagation, that's interesting.

Also I'm curious if different hardware profiles or loads would be better served with different thread priorities.

@amotin
Copy link
Member

amotin commented Jul 21, 2021

Umm, priority propagation, that's interesting.

We sort of have it indirectly for dbuf cache reclaim, since each separate thread can help with the reclamation if the dedicated thread can't keep up. But ARC reclamation is more complicated due to different types of caches (mfu/mru, data/metadata). If we could dramatically simplify that somehow...

@mmatuska
Copy link
Contributor

mmatuska commented Jul 29, 2021

@behlendorf any special reason why this PR is not in accepted state yet?

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 29, 2021
@behlendorf
Copy link
Contributor

@mmaybee has been shepherding this PR, but it looks ready to merge to me.

@AndCycle
Copy link

wanna mention about thread priority change that I forgot to follow up years ago
this can traceback to 1229323

the old priority value around this is

#define	minclsyspri	60
#define	maxclsyspri	99

from this value my guess about Solaris ZFS stays in system class priority from 60 to 99
https://flylib.com/books/en/2.830.1.41/1/
global_3_8
so it never needs to fight for CPU with user-space program

probably shouldn't left minclsyspri as the lowest value on linux
which easily get it into the pitfall with cpu hunger system

just my two cents

@mmaybee mmaybee merged commit 6bc61d2 into openzfs:master Aug 10, 2021
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 23, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
behlendorf pushed a commit that referenced this pull request Aug 31, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes #12397
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 15, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Sep 22, 2021
Run arc_evict thread at higher priority, nice=0, to give it more CPU
time which can improve performance for workload with high ARC evict
activities.

On mixed read/write and sequential read workloads, I've seen between
10-40% better performance.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#12397
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants