Scrubbing exhausts all available memory #11574

arthurfabre · 2021-02-06T18:15:53Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	Sid
Linux Kernel	5.10.13
Architecture	ppc64le
ZFS Version	2.0.2
SPL Version	2.0.2

I can also reproduce this using:

kernel 5.10.13 with ZFS 2.01
kernel 5.9.11 with ZFS 2.0.1, 2.0.2

But not with ZFS 0.8.6. I can't reproduce it at all on a similar x86 system.

Describe the problem you're observing

When scrubbing a dataset (4 drive raidz2) memory usage rises until all system memory is exhausted, and the kernel panics.
If the scrub is stopped before the kernel panics (zpool scrub -s), memory usage drops back to the same level as before the scrub was started.

Describe how to reproduce the problem

This script reproduces the problem:

#!/bin/bash

function dump {
    free -m > free."$1".txt
    cat  /proc/spl/kmem/slab > spl-slab."$1".txt
    sudo slabtop -o > slabtop."$1".txt
    sudo cat /proc/slabinfo > slabinfo."$1".txt
    cat /proc/meminfo > meminfo."$1".txt
}

dump before
sudo zpool scrub data
sleep 30
dump during
sudo zpool scrub -s data
dump after

The used memory increases from 8GB to 72GB in 30 seconds, and returns to 8GB after the scrub is stopped. vmalloc seems responsible for the majority of this:

	VmallocUsed
Before	2.4 GB (2389248 KB)
During	68 GB (68183296 KB)
After	2.4 GB (2408192 KB)

meminfo.before.txt
slabinfo.before.txt
slabtop.before.txt
spl-slab.before.txt
free.before.txt

meminfo.during.txt
slabinfo.during.txt
slabtop.during.txt
spl-slab.during.txt
free.during.txt

meminfo.after.txt
slabinfo.after.txt
slabtop.after.txt
spl-slab.after.txt
free.after.txt

Include any warning/errors/backtraces from the system logs

Last kernel logs (including OOM killer running) before kernel panics (unfortunately the panic does not get logged to disk):
oom.txt

The text was updated successfully, but these errors were encountered:

arthurfabre · 2021-02-07T09:35:50Z

Updated the script to collect more info:

    cat /proc/spl/kstat/zfs/dbufstats > dbufstats."$1".txt
    sudo arc_summary > arc_summary."$1".txt
    sudo arcstat > arcstat."$1".txt
    sudo dbufstat > dbufstat."$1".txt

arcstat.before.txt
arc_summary.before.txt
dbufstats.before.txt
dbufstat.before.txt

arcstat.during.txt
arc_summary.during.txt
dbufstats.during.txt
dbufstat.during.txt

arcstat.after.txt
arc_summary.after.txt
dbufstats.after.txt
dbufstat.after.txt

Nothing particularly stands out to me aside from a few tunables being set to 2^64-1:

        dbuf_cache_max_bytes                        18446744073709551615
        dbuf_metadata_cache_max_bytes               18446744073709551615
        zfs_async_block_max_blocks                  18446744073709551615

But the x86 system I can't reproduce the problem on has the same values.

matclayton · 2021-02-07T17:28:01Z

This seems linked to a bug we filed #11429 incase we can help at all.

arthurfabre · 2021-02-07T22:06:59Z

#11429 reports an increase in SUnreclaim while scrubbing, but I don't see that in this case:

	SUreclaim
Before	1.3 GB (1302208 KB)
During	1.4 GB (1405184 KB)
After	1.3 GB (1323456 KB)

Compared with VmallocUsed, which shows the problem:

	VmallocUsed
Before	2.4 GB (2389248 KB)
During	68 GB (68183296 KB)
After	2.4 GB (2408192 KB)

I also can't reproduce this issue with ZFS & SPL 0.8.6, only on 2.0.1 / 2.0.2 (I haven't tested 2.0.0). But #11429 is against 0.8.3, I think it's a different issue.

anssite · 2021-03-16T08:34:20Z

Same problem with Debian Buster and zfs 2.0.3-1~bpo10+1

Tested with the latest buster 4.19 kernel and also with the latest buster backports 5.10 kernel.

	Before	During	After
SUreclaim	0.9 GB	1.8 GB	0.9 GB
VmallocUsed	4.09GB	41.7 GB	4,4 GB

At first the used memory increases fast, then settles to about those numbers, but keeps growing slowly. Scrub speed seems to be a lot faster than compared to Debian Stretch, 4.9 kernel and zfs 0.7.12.

With zfs 2.0.3 and D10 the scrub speed is reported at about 5G/s with HDD SAS disks and seems to finish in a few hours. With Debian Stretch, 4.9 kernel and zfs 0.7.12 and same kind of hardware the scrub speed is about 2.9M/s and takes days to finish, but no problems with exhausted memory.

arthurfabre · 2021-03-16T09:14:15Z

@anssite What architecture are you seeing this on?

I imported my pool on an x86 system (same zfs and kernel version) over the weekend and was unable to reproduce the problem. But it happens without fail on ppc64.

/proc/vmallocinfo seems to suggest the culprit is spl_cache_grow_work.

One related setting that differs between ppc64 and x86 is spl_kmem_cache_slab_limit:

zfs/module/os/linux/spl/spl-kmem-cache.c

Lines 105 to 109 in d0249a4

    
           #if PAGE_SIZE == 4096 
        
           unsigned int spl_kmem_cache_slab_limit = 16384; 
        
           #else 
        
           unsigned int spl_kmem_cache_slab_limit = 0; 
        
           #endif

Setting it to 16384 on ppc64 helps a lot: I can now run a full scrub (~7 hours). Without it set, I run out of memory in less than a minute. But I don't know if it's the actual culprit or just a red-herring that causes most of the memory to be allocated differently, masking the real issue.

anssite · 2021-03-16T09:37:34Z

@arthurfabre i'm running x86.

This is a pool that previously was running on D10 and zfs 0.86 and then the zfs was upgraded to 2.0.3 and the pool imported and upgraded with zpool upgrade. Although the same problem existed before zpool upgrade.

Edit: Don't have hardware to test this in fresh installed pool. So it might be that this is relevant only on pools that were upgraded from 0.86 to 2.0.x

sideeffect42 · 2021-03-16T20:18:07Z

@arthurfabre Does setting zfs_scan_legacy=1 help in your case?

anssite · 2021-03-17T06:25:14Z

@sideeffect42

It seems that zfs_scan_legacy=1 will help. scrub speed is reduced to about 10M/s and memory consumption is minimal.

VmallocUsed value during scrub 1.8GB
SUnreclaim value during scrub 281MB

So i guess that the new scan mode is really heavy on resources. Too bad, as it seems to be a lot faster than the old mode.

sideeffect42 · 2021-03-17T09:40:52Z

@anssite I just tested scrubbing with spl_kmem_cache_slab_limit=16384. It scrubs my modest pool of 160GB with ca. 4 GB of additional memory usage.
The new scan mode is faster, undoubtedly (0:22:48 vs. 0:34:43).

4 GB of memory should not be an issue on a modern server (but could be annoying for people using ZFS on their laptops).

To me this looks like a bug in SPL's slab allocator.
As a workaround, one could try to determine a proper spl_kmem_cache_slab_limit value for 64k pages.

Maybe @behlendorf could help here.

arthurfabre · 2021-03-17T10:20:28Z

Thanks for the suggestion @sideeffect42 , I'll check with zfs_scan_legacy=1.

To avoid any confusion, I think there are two different issues here:

Higher memory usage during scrubbing reported by @anssite. This doesn't use all memory and trigger OOM / kernel panics. Seems to be explained by a new scan mode.
Scrubbing using up all available memory (128GB total in my case), triggering the OOM killer and causing a kernel panic ~1 minute after scrubbing starts. I can reproduce this on ppc64, but not x86 (using the same pool and zfs + kernel versions).
spl_kmem_cache_slab_limit=16384 seems to work around the issue, but it may just be less noticeable.

@sideeffect42 it sounds like you don't see the issue with spl_kmem_cache_slab_limit=0, what architecture are you using?

sideeffect42 · 2021-03-17T13:12:41Z

@sideeffect42 I do see the issue (memory completely exhausted) with spl_kmem_cache_slab_limit=0 (the default on ppc64el with 64k pages).

If I manually set spl_kmem_cache_slab_limit=16384 the "new" scrub uses ~4GB memory and completes (the machine has 64GB in total).

delroth · 2021-03-21T03:56:31Z

I've been hitting this issue as well on a system that has > 4K page sizes (64K, ARMv8, in my specific case). Can confirm setting spl_kmem_cache_slab_limit=16384 fixed the problem. Sounds like the #if PAGE_SIZE == 4096 logic there might need some rethinking?

eglaysher · 2021-05-07T21:53:39Z

I've hit this same issue on ppc64le, and can confirm that setting spl_kmem_cache_slab_limit=16384 does work around the system crash.

sideeffect42 · 2021-05-10T11:31:02Z

What makes me curious is the commit message of f2297b5:

A cutoff of 16K was determined to be optimal for architectures using 4K pages.

I wonder how this was determined, so that the values for other common page sizes could also be added to the code.
I tried to ping @behlendorf above, but he didn't seem to notice.

arthurfabre · 2021-05-10T11:40:37Z

I think setting spl_kmem_cache_slab_limit to 16384 just masks the actual bug. On x86, spl_kmem_cache_slab_limit=0 works fine (I tested with the same pool, kernel and zfs versions). There's something else going on, but I haven't had time to look further.

For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150

For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150

For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12152 Closes #11429 Closes #11574 Closes #12150

openzfs/zfs#11574

arthurfabre added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Feb 6, 2021

sideeffect42 mentioned this issue Mar 16, 2021

zpool scrub leaks memory, leading to OOM within seconds #11756

Closed

behlendorf mentioned this issue May 28, 2021

[arm64] scrub uses all machine memory and locks up the machine #12150

Closed

behlendorf mentioned this issue May 29, 2021

Linux: Set spl_kmem_cache_slab_limit when page size !4K #12152

Merged

13 tasks

tonynguien closed this as completed in 7837845 Jun 3, 2021

dglidden mentioned this issue Jun 10, 2022

OOM on zfs scrub #13546

Closed

Rid mentioned this issue Jan 8, 2023

Random kernel BUG at mm/usercopy.c:99 from SLUB object 'zio_buf_comb_16384' #12543

Open

sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 19, 2023

[type/__ssrq_zfs] Add workaround for bug 11574

fd43113

openzfs/zfs#11574

sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 20, 2023

[type/__ssrq_zfs] Add workaround for bug 11574

ab62cd9

openzfs/zfs#11574

sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 21, 2023

[type/__ssrq_zfs] Add workaround for bug 11574

58216f5

openzfs/zfs#11574

arthurfabre mentioned this issue Apr 18, 2024

kernel NULL pointer dereference spl_kmem_cache_alloc+0x2c #16109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrubbing exhausts all available memory #11574

Scrubbing exhausts all available memory #11574

arthurfabre commented Feb 6, 2021 •

edited

Loading

arthurfabre commented Feb 7, 2021 •

edited

Loading

matclayton commented Feb 7, 2021

arthurfabre commented Feb 7, 2021

anssite commented Mar 16, 2021

arthurfabre commented Mar 16, 2021

anssite commented Mar 16, 2021 •

edited

Loading

sideeffect42 commented Mar 16, 2021

anssite commented Mar 17, 2021 •

edited

Loading

sideeffect42 commented Mar 17, 2021

arthurfabre commented Mar 17, 2021

sideeffect42 commented Mar 17, 2021 •

edited

Loading

delroth commented Mar 21, 2021 •

edited

Loading

eglaysher commented May 7, 2021

sideeffect42 commented May 10, 2021

arthurfabre commented May 10, 2021

Scrubbing exhausts all available memory #11574

Scrubbing exhausts all available memory #11574

Comments

arthurfabre commented Feb 6, 2021 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

arthurfabre commented Feb 7, 2021 • edited Loading

matclayton commented Feb 7, 2021

arthurfabre commented Feb 7, 2021

anssite commented Mar 16, 2021

arthurfabre commented Mar 16, 2021

anssite commented Mar 16, 2021 • edited Loading

sideeffect42 commented Mar 16, 2021

anssite commented Mar 17, 2021 • edited Loading

sideeffect42 commented Mar 17, 2021

arthurfabre commented Mar 17, 2021

sideeffect42 commented Mar 17, 2021 • edited Loading

delroth commented Mar 21, 2021 • edited Loading

eglaysher commented May 7, 2021

sideeffect42 commented May 10, 2021

arthurfabre commented May 10, 2021

arthurfabre commented Feb 6, 2021 •

edited

Loading

arthurfabre commented Feb 7, 2021 •

edited

Loading

anssite commented Mar 16, 2021 •

edited

Loading

anssite commented Mar 17, 2021 •

edited

Loading

sideeffect42 commented Mar 17, 2021 •

edited

Loading

delroth commented Mar 21, 2021 •

edited

Loading