Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield periodically when rebuilding L2ARC. #11116

Merged
merged 1 commit into from
Oct 30, 2020

Conversation

amotin
Copy link
Member

@amotin amotin commented Oct 27, 2020

Motivation and Context

L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes.

Description

This patch solves the issue by adding cond_resched() call after processing of every log block.

How Has This Been Tested?

The patch has been tested on FreeBSD by filling large L2ARC with 4KB blocks, re-importing the pool to start the rebuild and attempt to read dev.cpu sysctl tree. Without the patch some of the sysctls there got stuck trying to bind to respective CPU cores. With the patch the sysctl delays are barely noticeable.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Oct 27, 2020
@amotin amotin requested a review from a user October 27, 2020 02:18
@amotin amotin self-assigned this Oct 27, 2020
@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Merging #11116 into master will decrease coverage by 0.03%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11116      +/-   ##
==========================================
- Coverage   79.81%   79.77%   -0.04%     
==========================================
  Files         398      398              
  Lines      125754   125758       +4     
==========================================
- Hits       100367   100323      -44     
- Misses      25387    25435      +48     
Flag Coverage Δ
#kernel 80.44% <100.00%> (-0.04%) ⬇️
#user 65.93% <0.00%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/zfs/arc.c 89.55% <60.00%> (-0.20%) ⬇️
cmd/zdb/zdb_il.c 30.86% <0.00%> (-24.08%) ⬇️
module/zfs/vdev_raidz.c 89.68% <0.00%> (-4.24%) ⬇️
module/zfs/spa_errlog.c 90.83% <0.00%> (-3.06%) ⬇️
module/os/linux/spl/spl-kmem-cache.c 88.57% <0.00%> (-1.51%) ⬇️
module/os/linux/zfs/zio_crypt.c 79.97% <0.00%> (-1.11%) ⬇️
module/os/linux/zfs/zvol_os.c 87.36% <0.00%> (-0.89%) ⬇️
module/icp/api/kcf_cipher.c 15.76% <0.00%> (-0.83%) ⬇️
module/zfs/zvol.c 84.39% <0.00%> (-0.68%) ⬇️
module/zfs/vdev_rebuild.c 96.73% <0.00%> (-0.66%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c810ac...f493a4b. Read the comment docs.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a typo, but great to get to the bottom of this issue 👍

man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
man/man5/zfs-module-parameters.5 Outdated Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
@amotin amotin force-pushed the l2rb_throttle branch 2 times, most recently from 1e55718 to bc40055 Compare October 27, 2020 13:15
@amotin
Copy link
Member Author

amotin commented Oct 27, 2020

@gamanakis You've done prefetching too good. ;)

@gamanakis
Copy link
Contributor

@amotin I am glad you figured it out. Looks good to me!

@mjguzik
Copy link
Contributor

mjguzik commented Oct 27, 2020

Have you tested this with calling maybe_yield instead?

@amotin
Copy link
Member Author

amotin commented Oct 27, 2020

Have you tested this with calling maybe_yield instead?

I was thinking about something like this, but missed the priority trick this particular function does. I've tested it and it really seems to work. Thank you very much!

Do you think it should be done instead or in addition to? L2ARC rebuild has nowhere to hurry really. As I understand the process still remains at the top user-space priority.

Do you know maybe_yield()'s counterpart on Linux?

@mjguzik
Copy link
Contributor

mjguzik commented Oct 27, 2020

Used to be cond_resched or so. Note I'm not confident they have equivalent semantics, but maybe_yield should be fine on FreeBSD at least.

@behlendorf
Copy link
Contributor

I was going to suggest trying the same thing. You can use cond_resched() which is the Linux function name, there are existing wrappers for both FreeBSD kern_yield(PRI_USER) and userland sched_yield(). We call cond_resched() at the end of arc_evict_state_impl for this same reason.

@mjguzik
Copy link
Contributor

mjguzik commented Oct 27, 2020

Now that you mention it I see 2 definitions of cond_resched which may concern the FreeBSD port:

the zfs repo:

include/os/freebsd/zfs/sys/zfs_context_os.h (lists it twice btw):
#define cond_resched() kern_yield(PRI_USER)

include/sys/zfs_context.h:
#define cond_resched() sched_yield()

Apart from that there is one in the Linux API compat in FreeBSD: compat/linuxkpi/common/include/linux/sched.h
#define cond_resched() do { if (!cold) sched_relinquish(curthread); } while (0)

I don't know what ends up being used in this mess (is hte zfs_context.h header for userspace?), I presume for kernel it's the zfs_context.os.h one.

That said, maybe_yield definitely needs to be tested to give the hoped for result (it should). Past that, looks like the patch should
invoke cond_resched and make sure it does maybe_yield on FreeBSD while removing spurious definitions at least from the port.

tl;dr please try cond_resched with this:

diff --git a/include/os/freebsd/zfs/sys/zfs_context_os.h b/include/os/freebsd/zfs/sys/zfs_context_os.h
index 3118c7d96..2962689bb 100644
--- a/include/os/freebsd/zfs/sys/zfs_context_os.h
+++ b/include/os/freebsd/zfs/sys/zfs_context_os.h
@@ -41,7 +41,7 @@
 #include <sys/ccompat.h>
 #include <linux/types.h>
 
-#define        cond_resched()          kern_yield(PRI_USER)
+#define        cond_resched()          maybe_yield()
 #define        uio_prefaultpages(size, uio) (0)
 
 #define        taskq_create_sysdc(a, b, d, e, p, dc, f) \
@@ -57,7 +57,6 @@
 #define        tsd_set(key, value)     osd_thread_set(curthread, (key), (value))
 #define        fm_panic        panic
 
-#define        cond_resched()          kern_yield(PRI_USER)
 extern int zfs_debug_level;
 extern struct mtx zfs_debug_mtx;
 #define        ZFS_LOG(lvl, ...) do {   \

@amotin
Copy link
Member Author

amotin commented Oct 27, 2020

Thank you Brian. I've added that macro call.

Though I see on FreeBSD it maps into kern_yield(PRI_USER), that is more aggressive than maybe_yield(). It should not be a problem now, but maybe_yield() may be easier to use in some scenarios where the rate is unknown and switch is not required by algorithm.

@amotin
Copy link
Member Author

amotin commented Oct 27, 2020

@mjguzik I've tested the maybe_yield() already in L2ARC rebuild context and it works just fine, and since the call period is already about 1ms I don't think there is a big deal which one to call. On the other side I see number of other cond_resched() calls in a tight loops in the dnode code, and there maybe_yield() would be a waste of CPU time. On the third side arc_evict_state_impl() I think could really benefit from maybe_yield(), since the loop there should be pretty fast despite the batching.

@amotin
Copy link
Member Author

amotin commented Oct 29, 2020

Any more thoughts/objections about this?

@mjguzik
Copy link
Contributor

mjguzik commented Oct 29, 2020

The idiom in the FreeBSD kernel as far as I know is to maybe_yield. That's what the vfs layer is doing fwiw and quick grep reveals other places like the vm. With this in mind I think maybe_yield should be the default on FreeBSD and only deviated from with a good reason, which is part of why I asked if something regressed with using that instead of delay.

Still, I'm not going to insist on one way or the other.

@amotin
Copy link
Member Author

amotin commented Oct 29, 2020

@mjguzik I just wanted to say that we may need two separate primitives in OpenZFS due to semantic aspects I've described. It is unrelated to this change, since already existing code would benefit from having two.

For this PR my only question is whether leave the delay() call or somebody think it is overkill?

@mjguzik
Copy link
Contributor

mjguzik commented Oct 29, 2020

The delay thing is what I argued is likely slightly pessimal if maybe_yield is employed. Iow, imo it should be removed unless a problem can be demonstrated.

L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
@amotin
Copy link
Member Author

amotin commented Oct 29, 2020

OK, reduced to one-liner.

unless a problem can be demonstrated.

Probably not on a big system, at least it does not block sysctls for me any more as it was. I was thinking about some dual-core system having several L2ARC device, but in that case default throttling would be too low, so all hope it still on the scheduler.

@amotin amotin changed the title Sleep periodically when rebuilding L2ARC. Yield periodically when rebuilding L2ARC. Oct 29, 2020
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Oct 30, 2020
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Oct 30, 2020
@behlendorf behlendorf merged commit 1199c3e into openzfs:master Oct 30, 2020
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Oct 30, 2020
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11116
behlendorf pushed a commit that referenced this pull request Oct 30, 2020
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11116
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11116
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11116
@amotin amotin deleted the l2rb_throttle branch August 24, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants