Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise ARC shrinker algorithm #10600

Merged
merged 16 commits into from
Aug 1, 2020
Merged

Revise ARC shrinker algorithm #10600

merged 16 commits into from
Aug 1, 2020

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Jul 20, 2020

Motivation and Context

The ARC shrinker callback arc_shrinker_count/_scan() is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:

  1. "direct reclaim": The system is attempting to allocate a page, but we
    are low on memory. The ARC shrinker callback is invoked from the
    page-allocation code path.

  2. "indirect reclaim": kswapd notices that there aren't many free pages,
    so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via __free_pages(), which is how the ARC
releases memory (via abd_free_chunks()). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to arc_c_min. This has several
negative impacts:

  1. ZFS doesn't use RAM to cache data effectively.

  2. In the direct reclaim case, a single page allocation may wait a long
    time (e.g. more than a minute) while we evict the entire ARC.

  3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks
    reads/writes"), occasionally arc_size may stay above arc_c for the
    entire time of the ARC collapse, thus blocking ZFS read/write operations
    in arc_get_data_impl().

Description

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

  1. We limit the amount of data that can be reclaimed from the ARC via
    the "direct reclaim" shrinker. This limits the amount of time it takes
    to allocate a single page.

  2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
    Instead we rely on arc_evict_zthr to monitor free memory and reduce
    the ARC target size to keep sufficient free memory in the system. Note
    that we can't simply rely on limiting the amount that we reclaim at once
    (as for the direct reclaim case), because kswapd's "boosted" logic can
    invoke the callback an unlimited number of times (see
    balance_pgdat()).

  3. When arc_is_overflowing() and we want to allocate memory,
    arc_get_data_impl() will wait only for a multiple of the requested
    amount of data to be evicted, rather than waiting for the ARC to no
    longer be overflowing. This allows ZFS reads/writes to make progress
    even while the ARC is overflowing, while also ensuring that the eviction
    thread makes progress towards reducing the total amount of memory used
    by the ARC.

  4. The amount of memory that the ARC always tries to keep free for the
    rest of the system, arc_sys_free is increased.

Note: this PR depends on and contains the commit from #10592

How Has This Been Tested?

Manual testing, applying memory pressure and observing arcstat, and calls to the shrinker.
ZTS.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@ahrens ahrens requested review from behlendorf, a user, grwilson and pzakha July 20, 2020 16:55
@ahrens ahrens force-pushed the arc branch 2 times, most recently from 7fdee8d to e69daa0 Compare July 20, 2020 17:08
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks interesting. I just worry that with threads not getting blocked by constantly entering and exiting sleep there will be huge congestion on arc_evict_lock. Practically all I/O plus the reclamation process itself will be require acquisition of global lock per block.

/*
* It is unsafe to block here in arbitrary threads, because we can come
* here from ARC itself and may hold ARC locks and thus risk a deadlock
* with ARC reclaim thread.
*/
if (curproc == pageproc)
(void) cv_wait(&arc_adjust_waiters_cv, &arc_adjust_lock);
mutex_exit(&arc_adjust_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've removed mutex_exit(), while keeping mutex_enter() above. It seems unfinished.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yes I meant to remove the mutex_enter() and associated code above. Fixed now.

@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

Merging #10600 into master will decrease coverage by 0.02%.
The diff coverage is 78.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10600      +/-   ##
==========================================
- Coverage   79.72%   79.70%   -0.03%     
==========================================
  Files         393      393              
  Lines      124627   124627              
==========================================
- Hits        99365    99334      -31     
- Misses      25262    25293      +31     
Flag Coverage Δ
#kernel 80.17% <27.08%> (-0.04%) ⬇️
#user 66.06% <84.14%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/os/linux/zfs/sys/trace_arc.h 93.33% <0.00%> (-6.67%) ⬇️
module/os/linux/zfs/arc_os.c 36.73% <50.00%> (-22.77%) ⬇️
module/zfs/arc.c 89.05% <83.13%> (-0.53%) ⬇️
cmd/zvol_id/zvol_id_main.c 76.31% <0.00%> (-5.27%) ⬇️
module/zfs/vdev_indirect.c 77.16% <0.00%> (-4.17%) ⬇️
lib/libzfs/os/linux/libzfs_mount_os.c 68.86% <0.00%> (-2.84%) ⬇️
module/os/linux/spl/spl-kmem-cache.c 82.02% <0.00%> (-2.81%) ⬇️
module/zcommon/zfs_fletcher.c 75.65% <0.00%> (-2.64%) ⬇️
module/zfs/dsl_synctask.c 92.40% <0.00%> (-2.54%) ⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 317dbea...88c145e. Read the comment docs.

@ahrens
Copy link
Member Author

ahrens commented Jul 21, 2020

@amotin

I just worry that with threads not getting blocked by constantly entering and exiting sleep there will be huge congestion on arc_evict_lock. Practically all I/O plus the reclamation process itself will be require acquisition of global lock per block.

I'm not seeing that change, perhaps a failure of my imagination. Which additional locking are you referring to?

The reclamation process already grabs the global lock for every block, in arc_evict_state_impl().

The reclamation process also already grabs the global lock frequently, in arc_shrinker_scan() on Linux, and I assume the equivalent on FreeBSD is arc_lowmem() which also grabs the lock.

@amotin
Copy link
Member

amotin commented Jul 21, 2020

There is no additional locking, only the same lock to be taken more often since workload threads will actually wakeup and go back to sleep. I am not saying it was good before. Just thinking whether we could batch reclamation somehow.

@ahrens
Copy link
Member Author

ahrens commented Jul 21, 2020

@amotin I see: in arc_get_data_impl(), we previously (essentially) waited for !arc_is_overflowing(), which could be a long time, but now it's only waiting for 2x the requested space to be evicted, so it will wake up sooner, which is my goal. The end result is a higher throughput of user requests. But each user request might also need to grab the global lock, and this lock contention could limit the performance increase that this PR delivers.

All of this only matters when the ARC is shrinking, due to memory pressure from outside ZFS. Normally the amount of time spent in this state is small, so the impact to average performance should not be very significant one way or another. But this change smooths out performance by reducing very-high latency events while the ARC is shrinking.

That said, I went ahead and measured the lock contention (via a CPU flame graph) while the ARC is shrinking. Locking of the arc_evict_lock via the user workload (in this case, dd) is barely visible, and much smaller than other lock contention:

image

The highlighted-purple frames are those containing `mutex`. The lock in question is the purple above `arc_wait_...` above `arc_get_data_impl`. `arc_wait_for_eviction()` is about 1% of the CPU time in `zfs_read()` - while the ARC is shrinking.

@amotin
Copy link
Member

amotin commented Jul 21, 2020

I'd make sure to reduce recordsize of the dataset to some 8/16K to not measure memcpy() and run couple dozen of dd instances to increase concurrency.

You may be right that it is not the biggest problem, but it looks weird to have multilist with separate locks to avoid contention, and after taking one random of them still take a global lock. Sure there are other places where the separate locks are taken and not this one, but still.

@ahrens
Copy link
Member Author

ahrens commented Jul 21, 2020

@amotin My test was with recordsize=4k and a single dd process.

@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

Merging #10600 into master will increase coverage by 0.13%.
The diff coverage is 98.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10600      +/-   ##
==========================================
+ Coverage   79.65%   79.79%   +0.13%     
==========================================
  Files         394      394              
  Lines      124631   124644      +13     
==========================================
+ Hits        99278    99460     +182     
+ Misses      25353    25184     -169     
Flag Coverage Δ
#kernel 80.26% <96.82%> (+0.07%) ⬆️
#user 66.23% <97.72%> (+0.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/zfs/arc.c 89.88% <97.77%> (+0.88%) ⬆️
include/os/linux/zfs/sys/trace_arc.h 100.00% <100.00%> (ø)
module/os/linux/zfs/arc_os.c 62.50% <100.00%> (+2.99%) ⬆️
cmd/zvol_id/zvol_id_main.c 76.31% <0.00%> (-5.27%) ⬇️
module/zfs/bpobj.c 86.86% <0.00%> (-4.29%) ⬇️
module/lua/lmem.c 83.33% <0.00%> (-4.17%) ⬇️
module/os/linux/spl/spl-kmem-cache.c 75.74% <0.00%> (-2.39%) ⬇️
module/zfs/zil.c 91.33% <0.00%> (-1.79%) ⬇️
module/zfs/vdev_mirror.c 94.79% <0.00%> (-1.49%) ⬇️
module/zfs/space_map.c 92.92% <0.00%> (-1.22%) ⬇️
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a92552...a363eb5. Read the comment docs.

@behlendorf behlendorf added Status: Code Review Needed Ready for review and testing Component: Memory Management kernel memory management labels Jul 22, 2020
module/os/linux/zfs/arc_os.c Show resolved Hide resolved
module/zfs/arc.c Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved
* The default limit of 10,000 (in practice, 160MB per allocation attempt)
* limits the amount of time spent attempting to reclaim ARC memory to around
* 100ms second per allocation attempt, even with a small average compressed
* block size of ~8KB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning that this math assumes 4k pages would make this more clear. Particularly these days when larger page sizes are no longer that exotic. 10,000 pages * 4K page * 4 attempts = 160M. Correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually x = 10,000 pages * 4KB/page * 2 (at priority=0, the shrinker logic tries to evict double what we have) + x/2 + x/4 + x/8 + x/16 + ... (for priorities 1,2,3,... that it normally does before getting to priority=0)

@amotin
Copy link
Member

amotin commented Jul 23, 2020

@ahrens, do you see real reason to wakeup sleepers for every arc_evict_hdr() call? Since there is already a batching mechanism, wouldn't it be more efficient to do it out of the loop? It would give more priority to reclamation, while still run I/Os at least once for each 10 (zfs_arc_evict_batch_limit) reclaimed blocks.

@ahrens
Copy link
Member Author

ahrens commented Jul 23, 2020

@amotin I see, you're suggesting that we move the code in arc_evict_state_impl() that's under the arc_evict_lock to outside the loop, and increment the arc_evict_count based on the accumulated bytes_evicted rather than evicted.

I think that would work. How would that give more priority to reclamation? Presumably this wouldn't delay the waiters significantly. And we don't spend significant time waiting on the arc_evict_lock here (<2% of the CPU in arc_evict, mutex_enter is highlighted):

image

@amotin
Copy link
Member

amotin commented Jul 23, 2020

@amotin I see, you're suggesting that we move the code in arc_evict_state_impl() that's under the arc_evict_lock to outside the loop, and increment the arc_evict_count based on the accumulated bytes_evicted rather than evicted.

Right.

I think that would work. How would that give more priority to reclamation?

By not distracting it too often from its main duties.

Presumably this wouldn't delay the waiters significantly. And we don't spend significant time waiting on the arc_evict_lock here (<2% of the CPU in arc_evict, mutex_enter is highlighted):

Lock is one thing. It may be not big now, but change dramatically if you run 20 or 50 dd threads. Also cv_broadcast(), that means CPU scheduling, etc, that I guess may be the "__cv_broadca.." you see on the left side of the picture, which visually takes much more time than locks selected in blue.

And BTW moving it down would reduce the multilist sublist lock hold time, that is also not great, despite there are many of them.

@ahrens
Copy link
Member Author

ahrens commented Jul 23, 2020

By not distracting it too often from its main duties.

I see, so the goal would be to reduce the total amount of time that the evict thread spends dealing with waking waiters.

Lock is one thing. It may be not big now, but change dramatically if you run 20 or 50 dd threads.

That makes sense. Have you seen that problem? It should be the same with or without the changes in this PR.

Also cv_broadcast(), that means CPU scheduling, etc, that I guess may be the "__cv_broadca.." you see on the left side of the picture, which visually takes much more time than locks selected in blue.

Yes, but I don't see how the change you're proposing would have any impact on that. Either way, each waiter gets woken once.

@amotin
Copy link
Member

amotin commented Jul 23, 2020

Lock is one thing. It may be not big now, but change dramatically if you run 20 or 50 dd threads.

That makes sense. Have you seen that problem? It should be the same with or without the changes in this PR.

Without changes in this PR sleeping threads are sitting flat waiting until arc_is_overflowing() return false. What I have seen many times though is that single global locks do not scale, literally exploding when number of threads sleeping and waking grows to dozens. That is why I am heavily trying to avoid such scenarios in busy paths.

Also cv_broadcast(), that means CPU scheduling, etc, that I guess may be the "__cv_broadca.." you see on the left side of the picture, which visually takes much more time than locks selected in blue.

Yes, but I don't see how the change you're proposing would have any impact on that. Either way, each waiter gets woken once.

Once per block evicted or per 10. My proposal change trades back some I/O latency for more efficient eviction.

@ahrens
Copy link
Member Author

ahrens commented Jul 23, 2020

@amotin I'm trying to understand how much efficiency we would gain by batching the lock. I increased the number of dd threads to be the same as NCPU's (8 in my case) and made the change you're suggesting. I can see that the time in mutex_lock goes from 6% to 1% of the time in arc_evict(), which is nice. However, the rate that arc_evict() is able to evict data is the same in either case (1.0GB/sec, with recordsize=4k).

I'll go ahead and make the change.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be rebased after #10609

man/man5/zfs-module-parameters.5 Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved
module/os/linux/zfs/arc_os.c Outdated Show resolved Hide resolved
ahrens added 15 commits July 29, 2020 20:22
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 30, 2020
@adamdmoss
Copy link
Contributor

Totally not suggesting that this PR be held up on it, but: does ZFS have objective benchmarks for ARC efficiency under various use-cases? Ideally automated somewhere?

Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
@ahrens
Copy link
Member Author

ahrens commented Jul 31, 2020

@adamdmoss It does not, which is how the poor ARC behavior has gone unnoticed for so long. It would be wonderful to develop a suite of tests along those lines!

@behlendorf behlendorf merged commit 3442c2a into openzfs:master Aug 1, 2020
@youzhongyang youzhongyang mentioned this pull request Aug 19, 2020
11 tasks
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10600
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0d ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants