-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yield periodically when rebuilding L2ARC. #11116
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11116 +/- ##
==========================================
- Coverage 79.81% 79.77% -0.04%
==========================================
Files 398 398
Lines 125754 125758 +4
==========================================
- Hits 100367 100323 -44
- Misses 25387 25435 +48
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted a typo, but great to get to the bottom of this issue 👍
1e55718
to
bc40055
Compare
@gamanakis You've done prefetching too good. ;) |
@amotin I am glad you figured it out. Looks good to me! |
Have you tested this with calling maybe_yield instead? |
I was thinking about something like this, but missed the priority trick this particular function does. I've tested it and it really seems to work. Thank you very much! Do you think it should be done instead or in addition to? L2ARC rebuild has nowhere to hurry really. As I understand the process still remains at the top user-space priority. Do you know maybe_yield()'s counterpart on Linux? |
Used to be cond_resched or so. Note I'm not confident they have equivalent semantics, but maybe_yield should be fine on FreeBSD at least. |
I was going to suggest trying the same thing. You can use |
Now that you mention it I see 2 definitions of cond_resched which may concern the FreeBSD port: the zfs repo: include/os/freebsd/zfs/sys/zfs_context_os.h (lists it twice btw): include/sys/zfs_context.h: Apart from that there is one in the Linux API compat in FreeBSD: compat/linuxkpi/common/include/linux/sched.h I don't know what ends up being used in this mess (is hte zfs_context.h header for userspace?), I presume for kernel it's the zfs_context.os.h one. That said, maybe_yield definitely needs to be tested to give the hoped for result (it should). Past that, looks like the patch should tl;dr please try cond_resched with this:
|
Thank you Brian. I've added that macro call. Though I see on FreeBSD it maps into kern_yield(PRI_USER), that is more aggressive than maybe_yield(). It should not be a problem now, but maybe_yield() may be easier to use in some scenarios where the rate is unknown and switch is not required by algorithm. |
@mjguzik I've tested the maybe_yield() already in L2ARC rebuild context and it works just fine, and since the call period is already about 1ms I don't think there is a big deal which one to call. On the other side I see number of other cond_resched() calls in a tight loops in the dnode code, and there maybe_yield() would be a waste of CPU time. On the third side arc_evict_state_impl() I think could really benefit from maybe_yield(), since the loop there should be pretty fast despite the batching. |
Any more thoughts/objections about this? |
The idiom in the FreeBSD kernel as far as I know is to maybe_yield. That's what the vfs layer is doing fwiw and quick grep reveals other places like the vm. With this in mind I think maybe_yield should be the default on FreeBSD and only deviated from with a good reason, which is part of why I asked if something regressed with using that instead of delay. Still, I'm not going to insist on one way or the other. |
@mjguzik I just wanted to say that we may need two separate primitives in OpenZFS due to semantic aspects I've described. It is unrelated to this change, since already existing code would benefit from having two. For this PR my only question is whether leave the delay() call or somebody think it is overkill? |
The delay thing is what I argued is likely slightly pessimal if maybe_yield is employed. Iow, imo it should be removed unless a problem can be demonstrated. |
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Signed-off-by: Alexander Motin <mav@FreeBSD.org>
OK, reduced to one-liner.
Probably not on a big system, at least it does not block sysctls for me any more as it was. I was thinking about some dual-core system having several L2ARC device, but in that case default throttling would be too low, so all hope it still on the scheduler. |
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Reviewed-by: Cedric Berger <cedric@precidata.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11116
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Reviewed-by: Cedric Berger <cedric@precidata.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11116
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Reviewed-by: Cedric Berger <cedric@precidata.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11116
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Reviewed-by: Cedric Berger <cedric@precidata.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11116
Motivation and Context
L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes.
Description
This patch solves the issue by adding cond_resched() call after processing of every log block.
How Has This Been Tested?
The patch has been tested on FreeBSD by filling large L2ARC with 4KB blocks, re-importing the pool to start the rebuild and attempt to read dev.cpu sysctl tree. Without the patch some of the sysctls there got stuck trying to bind to respective CPU cores. With the patch the sysctl delays are barely noticeable.
Types of changes
Checklist:
Signed-off-by
.