Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrubbing exhausts all available memory #11574

Closed
arthurfabre opened this issue Feb 6, 2021 · 15 comments
Closed

Scrubbing exhausts all available memory #11574

arthurfabre opened this issue Feb 6, 2021 · 15 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@arthurfabre
Copy link

arthurfabre commented Feb 6, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version Sid
Linux Kernel 5.10.13
Architecture ppc64le
ZFS Version 2.0.2
SPL Version 2.0.2

I can also reproduce this using:

  • kernel 5.10.13 with ZFS 2.01
  • kernel 5.9.11 with ZFS 2.0.1, 2.0.2

But not with ZFS 0.8.6. I can't reproduce it at all on a similar x86 system.

Describe the problem you're observing

When scrubbing a dataset (4 drive raidz2) memory usage rises until all system memory is exhausted, and the kernel panics.
If the scrub is stopped before the kernel panics (zpool scrub -s), memory usage drops back to the same level as before the scrub was started.

Describe how to reproduce the problem

This script reproduces the problem:

#!/bin/bash

function dump {
    free -m > free."$1".txt
    cat  /proc/spl/kmem/slab > spl-slab."$1".txt
    sudo slabtop -o > slabtop."$1".txt
    sudo cat /proc/slabinfo > slabinfo."$1".txt
    cat /proc/meminfo > meminfo."$1".txt
}

dump before
sudo zpool scrub data
sleep 30
dump during
sudo zpool scrub -s data
dump after

The used memory increases from 8GB to 72GB in 30 seconds, and returns to 8GB after the scrub is stopped. vmalloc seems responsible for the majority of this:

VmallocUsed
Before 2.4 GB (2389248 KB)
During 68 GB (68183296 KB)
After 2.4 GB (2408192 KB)

meminfo.before.txt
slabinfo.before.txt
slabtop.before.txt
spl-slab.before.txt
free.before.txt

meminfo.during.txt
slabinfo.during.txt
slabtop.during.txt
spl-slab.during.txt
free.during.txt

meminfo.after.txt
slabinfo.after.txt
slabtop.after.txt
spl-slab.after.txt
free.after.txt

Include any warning/errors/backtraces from the system logs

Last kernel logs (including OOM killer running) before kernel panics (unfortunately the panic does not get logged to disk):
oom.txt

@arthurfabre arthurfabre added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Feb 6, 2021
@arthurfabre
Copy link
Author

arthurfabre commented Feb 7, 2021

Updated the script to collect more info:

    cat /proc/spl/kstat/zfs/dbufstats > dbufstats."$1".txt
    sudo arc_summary > arc_summary."$1".txt
    sudo arcstat > arcstat."$1".txt
    sudo dbufstat > dbufstat."$1".txt

arcstat.before.txt
arc_summary.before.txt
dbufstats.before.txt
dbufstat.before.txt

arcstat.during.txt
arc_summary.during.txt
dbufstats.during.txt
dbufstat.during.txt

arcstat.after.txt
arc_summary.after.txt
dbufstats.after.txt
dbufstat.after.txt

Nothing particularly stands out to me aside from a few tunables being set to 2^64-1:

        dbuf_cache_max_bytes                        18446744073709551615
        dbuf_metadata_cache_max_bytes               18446744073709551615
        zfs_async_block_max_blocks                  18446744073709551615

But the x86 system I can't reproduce the problem on has the same values.

@matclayton
Copy link

This seems linked to a bug we filed #11429 incase we can help at all.

@arthurfabre
Copy link
Author

#11429 reports an increase in SUnreclaim while scrubbing, but I don't see that in this case:

SUreclaim
Before 1.3 GB (1302208 KB)
During 1.4 GB (1405184 KB)
After 1.3 GB (1323456 KB)

Compared with VmallocUsed, which shows the problem:

VmallocUsed
Before 2.4 GB (2389248 KB)
During 68 GB (68183296 KB)
After 2.4 GB (2408192 KB)

I also can't reproduce this issue with ZFS & SPL 0.8.6, only on 2.0.1 / 2.0.2 (I haven't tested 2.0.0). But #11429 is against 0.8.3, I think it's a different issue.

@anssite
Copy link

anssite commented Mar 16, 2021

Same problem with Debian Buster and zfs 2.0.3-1~bpo10+1

Tested with the latest buster 4.19 kernel and also with the latest buster backports 5.10 kernel.

Before During After
SUreclaim 0.9 GB 1.8 GB 0.9 GB
VmallocUsed 4.09GB 41.7 GB 4,4 GB

At first the used memory increases fast, then settles to about those numbers, but keeps growing slowly. Scrub speed seems to be a lot faster than compared to Debian Stretch, 4.9 kernel and zfs 0.7.12.

With zfs 2.0.3 and D10 the scrub speed is reported at about 5G/s with HDD SAS disks and seems to finish in a few hours. With Debian Stretch, 4.9 kernel and zfs 0.7.12 and same kind of hardware the scrub speed is about 2.9M/s and takes days to finish, but no problems with exhausted memory.

@arthurfabre
Copy link
Author

@anssite What architecture are you seeing this on?

I imported my pool on an x86 system (same zfs and kernel version) over the weekend and was unable to reproduce the problem. But it happens without fail on ppc64.

/proc/vmallocinfo seems to suggest the culprit is spl_cache_grow_work.

One related setting that differs between ppc64 and x86 is spl_kmem_cache_slab_limit:

#if PAGE_SIZE == 4096
unsigned int spl_kmem_cache_slab_limit = 16384;
#else
unsigned int spl_kmem_cache_slab_limit = 0;
#endif

Setting it to 16384 on ppc64 helps a lot: I can now run a full scrub (~7 hours). Without it set, I run out of memory in less than a minute. But I don't know if it's the actual culprit or just a red-herring that causes most of the memory to be allocated differently, masking the real issue.

@anssite
Copy link

anssite commented Mar 16, 2021

@arthurfabre i'm running x86.

This is a pool that previously was running on D10 and zfs 0.86 and then the zfs was upgraded to 2.0.3 and the pool imported and upgraded with zpool upgrade. Although the same problem existed before zpool upgrade.

Edit: Don't have hardware to test this in fresh installed pool. So it might be that this is relevant only on pools that were upgraded from 0.86 to 2.0.x

@sideeffect42
Copy link

@arthurfabre Does setting zfs_scan_legacy=1 help in your case?

@anssite
Copy link

anssite commented Mar 17, 2021

@sideeffect42

It seems that zfs_scan_legacy=1 will help. scrub speed is reduced to about 10M/s and memory consumption is minimal.

VmallocUsed value during scrub 1.8GB
SUnreclaim value during scrub 281MB

So i guess that the new scan mode is really heavy on resources. Too bad, as it seems to be a lot faster than the old mode.

@sideeffect42
Copy link

@anssite I just tested scrubbing with spl_kmem_cache_slab_limit=16384. It scrubs my modest pool of 160GB with ca. 4 GB of additional memory usage.
The new scan mode is faster, undoubtedly (0:22:48 vs. 0:34:43).

4 GB of memory should not be an issue on a modern server (but could be annoying for people using ZFS on their laptops).

To me this looks like a bug in SPL's slab allocator.
As a workaround, one could try to determine a proper spl_kmem_cache_slab_limit value for 64k pages.

Maybe @behlendorf could help here.

@arthurfabre
Copy link
Author

Thanks for the suggestion @sideeffect42 , I'll check with zfs_scan_legacy=1.

To avoid any confusion, I think there are two different issues here:

  • Higher memory usage during scrubbing reported by @anssite. This doesn't use all memory and trigger OOM / kernel panics. Seems to be explained by a new scan mode.

  • Scrubbing using up all available memory (128GB total in my case), triggering the OOM killer and causing a kernel panic ~1 minute after scrubbing starts. I can reproduce this on ppc64, but not x86 (using the same pool and zfs + kernel versions).
    spl_kmem_cache_slab_limit=16384 seems to work around the issue, but it may just be less noticeable.

    @sideeffect42 it sounds like you don't see the issue with spl_kmem_cache_slab_limit=0, what architecture are you using?

@sideeffect42
Copy link

sideeffect42 commented Mar 17, 2021

@sideeffect42 I do see the issue (memory completely exhausted) with spl_kmem_cache_slab_limit=0 (the default on ppc64el with 64k pages).

If I manually set spl_kmem_cache_slab_limit=16384 the "new" scrub uses ~4GB memory and completes (the machine has 64GB in total).

@delroth
Copy link

delroth commented Mar 21, 2021

I've been hitting this issue as well on a system that has > 4K page sizes (64K, ARMv8, in my specific case). Can confirm setting spl_kmem_cache_slab_limit=16384 fixed the problem. Sounds like the #if PAGE_SIZE == 4096 logic there might need some rethinking?

@eglaysher
Copy link

I've hit this same issue on ppc64le, and can confirm that setting spl_kmem_cache_slab_limit=16384 does work around the system crash.

@sideeffect42
Copy link

What makes me curious is the commit message of f2297b5:

A cutoff of 16K was determined to be optimal for architectures using 4K pages.

I wonder how this was determined, so that the values for other common page sizes could also be added to the code.
I tried to ping @behlendorf above, but he didn't seem to notice.

@arthurfabre
Copy link
Author

I think setting spl_kmem_cache_slab_limit to 16384 just masks the actual bug. On x86, spl_kmem_cache_slab_limit=0 works fine (I tested with the same pool, kernel and zfs versions). There's something else going on, but I haven't had time to look further.

behlendorf added a commit to behlendorf/zfs that referenced this issue May 29, 2021
For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 2, 2021
For small objects the kernel's slab implemention is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to PAGE_SIZE on systems using larger pages. Since 16,384 bytes
was experimentally determined to yield the best performance on
4K page systems this is used as the cutoff. This means on 4K
page systems there is no functional change.

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 3, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jun 4, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 8, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 9, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#12152
Closes openzfs#11429
Closes openzfs#11574
Closes openzfs#12150
tonyhutter pushed a commit that referenced this issue Jun 23, 2021
For small objects the kernel's slab implementation is very fast and
space efficient. However, as the allocation size increases to
require multiple pages performance suffers. The SPL kmem cache
allocator was designed to better handle these large allocation
sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers
prefer to use the kernel's slab allocator for small objects and
the custom SPL kmem cache allocator for larger objects.

This logic was effectively disabled for all architectures using
a non-4K page size which caused all kmem caches to only use the
SPL implementation. Functionally this is fine, but the SPL code
which calculates the target number of objects per-slab does not
take in to account that __vmalloc() always returns page-aligned
memory. This can result in a massive amount of wasted space when
allocating tiny objects on a platform using large pages (64k).

To resolve this issue we set the spl_kmem_cache_slab_limit cutoff
to 16K for all architectures. 

This particular change does not attempt to update the logic used
to calculate the optimal number of pages per slab. This remains
an issue which should be addressed in a future change.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12152
Closes #11429
Closes #11574
Closes #12150
sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 19, 2023
sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 20, 2023
sideeffect42 added a commit to riiengineering/skonfig-extra that referenced this issue Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants