-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise ARC shrinker algorithm #10600
Conversation
7fdee8d
to
e69daa0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks interesting. I just worry that with threads not getting blocked by constantly entering and exiting sleep there will be huge congestion on arc_evict_lock. Practically all I/O plus the reclamation process itself will be require acquisition of global lock per block.
module/os/freebsd/zfs/arc_os.c
Outdated
/* | ||
* It is unsafe to block here in arbitrary threads, because we can come | ||
* here from ARC itself and may hold ARC locks and thus risk a deadlock | ||
* with ARC reclaim thread. | ||
*/ | ||
if (curproc == pageproc) | ||
(void) cv_wait(&arc_adjust_waiters_cv, &arc_adjust_lock); | ||
mutex_exit(&arc_adjust_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've removed mutex_exit(), while keeping mutex_enter() above. It seems unfinished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yes I meant to remove the mutex_enter() and associated code above. Fixed now.
Codecov Report
@@ Coverage Diff @@
## master #10600 +/- ##
==========================================
- Coverage 79.72% 79.70% -0.03%
==========================================
Files 393 393
Lines 124627 124627
==========================================
- Hits 99365 99334 -31
- Misses 25262 25293 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I'm not seeing that change, perhaps a failure of my imagination. Which additional locking are you referring to? The reclamation process already grabs the global lock for every block, in The reclamation process also already grabs the global lock frequently, in |
There is no additional locking, only the same lock to be taken more often since workload threads will actually wakeup and go back to sleep. I am not saying it was good before. Just thinking whether we could batch reclamation somehow. |
@amotin I see: in All of this only matters when the ARC is shrinking, due to memory pressure from outside ZFS. Normally the amount of time spent in this state is small, so the impact to average performance should not be very significant one way or another. But this change smooths out performance by reducing very-high latency events while the ARC is shrinking. That said, I went ahead and measured the lock contention (via a CPU flame graph) while the ARC is shrinking. Locking of the |
I'd make sure to reduce recordsize of the dataset to some 8/16K to not measure memcpy() and run couple dozen of You may be right that it is not the biggest problem, but it looks weird to have multilist with separate locks to avoid contention, and after taking one random of them still take a global lock. Sure there are other places where the separate locks are taken and not this one, but still. |
@amotin My test was with recordsize=4k and a single |
Codecov Report
@@ Coverage Diff @@
## master #10600 +/- ##
==========================================
+ Coverage 79.65% 79.79% +0.13%
==========================================
Files 394 394
Lines 124631 124644 +13
==========================================
+ Hits 99278 99460 +182
+ Misses 25353 25184 -169
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
module/os/linux/zfs/arc_os.c
Outdated
* The default limit of 10,000 (in practice, 160MB per allocation attempt) | ||
* limits the amount of time spent attempting to reclaim ARC memory to around | ||
* 100ms second per allocation attempt, even with a small average compressed | ||
* block size of ~8KB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning that this math assumes 4k pages would make this more clear. Particularly these days when larger page sizes are no longer that exotic. 10,000 pages * 4K page * 4 attempts = 160M
. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually x = 10,000 pages * 4KB/page * 2
(at priority=0
, the shrinker logic tries to evict double what we have) + x/2 + x/4 + x/8 + x/16 + ...
(for priorities 1,2,3,... that it normally does before getting to priority=0)
@ahrens, do you see real reason to wakeup sleepers for every arc_evict_hdr() call? Since there is already a batching mechanism, wouldn't it be more efficient to do it out of the loop? It would give more priority to reclamation, while still run I/Os at least once for each 10 (zfs_arc_evict_batch_limit) reclaimed blocks. |
@amotin I see, you're suggesting that we move the code in I think that would work. How would that give more priority to reclamation? Presumably this wouldn't delay the waiters significantly. And we don't spend significant time waiting on the arc_evict_lock here (<2% of the CPU in arc_evict, |
Right.
By not distracting it too often from its main duties.
Lock is one thing. It may be not big now, but change dramatically if you run 20 or 50 dd threads. Also cv_broadcast(), that means CPU scheduling, etc, that I guess may be the "__cv_broadca.." you see on the left side of the picture, which visually takes much more time than locks selected in blue. And BTW moving it down would reduce the multilist sublist lock hold time, that is also not great, despite there are many of them. |
I see, so the goal would be to reduce the total amount of time that the evict thread spends dealing with waking waiters.
That makes sense. Have you seen that problem? It should be the same with or without the changes in this PR.
Yes, but I don't see how the change you're proposing would have any impact on that. Either way, each waiter gets woken once. |
Without changes in this PR sleeping threads are sitting flat waiting until arc_is_overflowing() return false. What I have seen many times though is that single global locks do not scale, literally exploding when number of threads sleeping and waking grows to dozens. That is why I am heavily trying to avoid such scenarios in busy paths.
Once per block evicted or per 10. My proposal change trades back some I/O latency for more efficient eviction. |
@amotin I'm trying to understand how much efficiency we would gain by batching the lock. I increased the number of I'll go ahead and make the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be rebased after #10609
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Totally not suggesting that this PR be held up on it, but: does ZFS have objective benchmarks for ARC efficiency under various use-cases? Ideally automated somewhere? |
Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks.
@adamdmoss It does not, which is how the poor ARC behavior has gone unnoticed for so long. It would be wonderful to develop a suite of tests along those lines! |
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10600
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10600
Motivation and Context
The ARC shrinker callback
arc_shrinker_count/_scan()
is invoked by thekernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
"direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
"indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via
__free_pages()
, which is how the ARCreleases memory (via
abd_free_chunks()
). Rather, the kernel shrinkercode is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to
arc_c_min
. This has severalnegative impacts:
ZFS doesn't use RAM to cache data effectively.
In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
Even with the improvements made in 67c0f0d ("ARC shrinking blocks
reads/writes"), occasionally
arc_size
may stay abovearc_c
for theentire time of the ARC collapse, thus blocking ZFS read/write operations
in
arc_get_data_impl()
.Description
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on
arc_evict_zthr
to monitor free memory and reducethe ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
balance_pgdat()
).When
arc_is_overflowing()
and we want to allocate memory,arc_get_data_impl()
will wait only for a multiple of the requestedamount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
The amount of memory that the ARC always tries to keep free for the
rest of the system,
arc_sys_free
is increased.Note: this PR depends on and contains the commit from #10592
How Has This Been Tested?
Manual testing, applying memory pressure and observing
arcstat
, and calls to the shrinker.ZTS.
Types of changes
Checklist:
Signed-off-by
.