Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

behlendorf · 2021-05-29T00:22:27Z

Motivation and Context

It was observed on systems using a 64k page size that running
zpool scrub could exhaust memory and trigger the OOM killer.
This should never be possible.

Issues #11429 #11574 #12150

Description

For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

How Has This Been Tested?

Locally compiled but it has not yet been tested on a system
using a large page size.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

AttilaFueloep · 2021-05-29T00:28:02Z

Wouldn't it be better to make this a ~~runtime~~ module load time selection rather than a compile time one? AFAIK AArch64 Linux supports 4k and 16K pages as well.

behlendorf · 2021-05-29T01:53:37Z

@AttilaFueloep 4k, 16k, and 64k pages are all supported for aarch64, but a specific page size needs to be selected when compiling the kernel. So there shouldn't be an issue setting this at compile time.

    CONFIG_<ARCH>_4K_PAGES
    CONFIG_<ARCH>_16K_PAGES
    CONFIG_<ARCH>_64K_PAGES

AttilaFueloep · 2021-05-29T02:17:11Z

Yes, but what happens if you compile the module on a box with a 4K page sized kernel and then load it on a box with a 64k kernel? Then there is a discrepancy between the compile time and the running on page sizes. Or am I missing something here?

AttilaFueloep · 2021-05-29T02:33:05Z

I think we should at least detect page size mismatches and refuse to load the module if so.

behlendorf · 2021-05-29T03:51:48Z

I think we should at least detect page size mismatches and refuse to load the module if so.

I believe this should already be the case even without additional changes. Though I'm not having much luck finding any documentation to confirm this.

behlendorf · 2021-05-29T04:03:01Z

@omarkilani thanks for the testing, I opted for setting the cutoff at the page size because the kernel's slab implementation(s) should allocate a single page per slab. As long as the object size is less than the page size I'd expect the kernel's slab to be both faster and more space efficient since it's highly optimized. Your testing seems to back that up. Where that story changes is a bit when allocations start requiring multiple pages.

AttilaFueloep · 2021-05-29T19:09:04Z

I believe this should already be the case even without additional changes. Though I'm not having much luck finding any documentation to confirm this.

Yeah, I couldn't find any documentation either and even grepping the Linux and kmod sources didn't brought up something revealing. Unfortunately I don't have a box supporting multiple page sizes to test this on. So I'll take this for granted, it seems to be quite a corner case anyway.

As a side note, not directly related to your change, the mixed usages of PAGESIZE and PAGE_SIZE I see in the sources puzzles me a bit. Skimming the sources my understanding is that PAGESIZE is defined as PAGE_SIZE if in kernel space and as sysconf(_SC_PAGESIZE) if in user space (ztest). If that's true, wouldn't using PAGE_SIZE outside of #ifdef _KERNEL blocks defeat that approach and shouldn't PAGESIZE always be preferred over PAGE_SIZE? Further I don't see the same handling for PAGE_SHIFT in user space outside of abd_os.c. I'm not sure if all of this could lead to some obscure ztest failures in the page size mismatch case.

behlendorf · 2021-05-29T22:10:05Z

the mixed usages of PAGESIZE and PAGE_SIZE I see in the sources puzzles me a bit.

The mixed usages comes from the fact that PAGESIZE is what was used on illumos, and so you're right that's the portable interface which should be used in all the common code. We could add the same wrappers for PAGESHIFT but it turns out there were never any existing consumers in the common code so they never got added.

In the platform specific code (module/os/<platform>/) there really isn't any issue with using the native interfaces in the right contexts. In general, that's what we've tried to do but we don't have a hard and fast rule about it. In this case I opted for PAGE_SIZE since this is platform specific code which is only used by the kernel.

AttilaFueloep · 2021-05-29T22:42:06Z

I see, thanks for the thorough explanation. No reason to change anything then.

ahrens · 2021-06-01T20:47:23Z

@behlendorf

I opted for setting the cutoff at the page size because the kernel's slab implementation(s) should allocate a single page per slab. As long as the object size is less than the page size I'd expect the kernel's slab to be both faster and more space efficient since it's highly optimized.

FYI that isn't true for caches with entries >=256 bytes, at least on PAGESIZE=4K systems:

sdb> slabs -o name,entry_size,entries_per_slab
name                        entry_size entries_per_slab
--------------------------- ---------- ----------------
kmalloc-96                          96               42 (1 page per slab)
kmalloc-128                        128               32 (1 page per slab)
kmalloc-192                        192               21 (1 page per slab)
kmalloc-256                        256               32 (2 pages per slab)
kmalloc-512                        512               32 (4 pages per slab)
kmalloc-1k                        1024               32 (8 pages per slab)
kmalloc-2k                        2048               16 (8 pages per slab)
kmalloc-4k                        4096                8 (8 pages per slab)
kmalloc-8k                        8192                4 (8 pages per slab)

Given the possibility of fragmentation, this aspect of the kernel's slab implementation seems like a bad design decision to me.

behlendorf · 2021-06-01T23:52:13Z

@ahrens yes, upon re-reading what I wrote I clearly left out more than a few important qualifiers. Looking at this again, I wonder if it wouldn't be preferable to simply set the cutoff at 16k for all architectures. Based on @omarkilani's testing any performance impact seems negligible. Thoughts?

omarkilani · 2021-06-02T15:35:31Z

This is probably a dumb question but how does one run the slabs -o name,entry_size,entries_per_slab command?

Googling sdb is about as helpful as you'd expect.

ahrens · 2021-06-02T15:48:30Z

@omarkilani sudo sdb -k: https://github.com/delphix/sdb

omarkilani · 2021-06-02T23:11:20Z

I spent most of the day rebuilding kernels and dependencies to get drgn working and then I realised... I could just cat /proc/slabinfo. lol.

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
kmalloc-128k          28     28 131072    4    8
kmalloc-64k          144    144  65536    8    8
kmalloc-32k          272    272  32768   16    8
kmalloc-16k         2944   2944  16384   32    8
kmalloc-8k           608    608   8192   32    4
kmalloc-4k          3072   3072   4096   32    2
kmalloc-2k          3072   3072   2048   32    1
kmalloc-1k         25490  25536   1024   64    1
kmalloc-512        19741  23424    512  128    1
kmalloc-256        20647  21248    256  256    1
kmalloc-128       613203 613376    128  512    1

I did some more tests with pgbench and zpool scrub while monitoring /proc/slabinfo at 16k and 64k.

In both cases nothing above kmalloc-4k gets used. And that only gets used during a zpool scrub. During pgbench nothing above kmalloc-1k gets used.

pgbench averages ~15,500tps at 64k and 15,000tps at 16k.

I ran pgbench 10 times under each limit setting and the better performance using 64k is small but reproducible.

IMHO, 16k is probably fine for a default setting if you're worried about fragmentation. I'm not sure it's the "optimal" setting for all workloads, but nothing bad's going to happen and people can just up the limit for Postgres or whatever.

Anything's better than 0 and certain OOM/crash.

tonynguien

Change looks good. Thanks for fixing the issue.

For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150

behlendorf · 2021-06-02T23:31:11Z

I've gone ahead and updated the PR to always use a 16K cutoff.. This keeps the behavior the same across platforms add should help mitigate any issues we might otherwise end up seeing with fragmentation.

For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150

For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12152 Closes #11429 Closes #11574 Closes #12150

behlendorf added Type: Architecture Indicates an issue is specific to a single processor architecture Status: Code Review Needed Ready for review and testing labels May 29, 2021

behlendorf mentioned this pull request May 29, 2021

[arm64] scrub uses all machine memory and locks up the machine #12150

Closed

behlendorf requested a review from tonynguien May 30, 2021 03:27

ahrens assigned tonynguien May 31, 2021

tonynguien requested review from ahrens and don-brady June 1, 2021 20:08

tonynguien approved these changes Jun 2, 2021

View reviewed changes

behlendorf force-pushed the issue-11574 branch from b3390a2 to 1f4820b Compare June 2, 2021 23:31

tonynguien approved these changes Jun 3, 2021

View reviewed changes

ahrens approved these changes Jun 3, 2021

View reviewed changes

tonynguien added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 3, 2021

tonynguien merged commit 7837845 into openzfs:master Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021 •

edited

Loading

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

AttilaFueloep commented May 29, 2021

behlendorf commented May 29, 2021

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

ahrens commented Jun 1, 2021

behlendorf commented Jun 1, 2021

omarkilani commented Jun 2, 2021

ahrens commented Jun 2, 2021

omarkilani commented Jun 2, 2021

tonynguien left a comment

behlendorf commented Jun 2, 2021

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

Conversation

behlendorf commented May 29, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

AttilaFueloep commented May 29, 2021 • edited Loading

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

AttilaFueloep commented May 29, 2021

behlendorf commented May 29, 2021

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

behlendorf commented May 29, 2021

AttilaFueloep commented May 29, 2021

ahrens commented Jun 1, 2021

behlendorf commented Jun 1, 2021

omarkilani commented Jun 2, 2021

ahrens commented Jun 2, 2021

omarkilani commented Jun 2, 2021

tonynguien left a comment

Choose a reason for hiding this comment

behlendorf commented Jun 2, 2021

AttilaFueloep commented May 29, 2021 •

edited

Loading