-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrubbing exhausts all available memory #11574
Comments
Updated the script to collect more info:
arcstat.before.txt arcstat.during.txt arcstat.after.txt Nothing particularly stands out to me aside from a few tunables being set to
But the x86 system I can't reproduce the problem on has the same values. |
This seems linked to a bug we filed #11429 incase we can help at all. |
#11429 reports an increase in
Compared with
I also can't reproduce this issue with ZFS & SPL 0.8.6, only on 2.0.1 / 2.0.2 (I haven't tested 2.0.0). But #11429 is against 0.8.3, I think it's a different issue. |
Same problem with Debian Buster and zfs 2.0.3-1~bpo10+1 Tested with the latest buster 4.19 kernel and also with the latest buster backports 5.10 kernel.
At first the used memory increases fast, then settles to about those numbers, but keeps growing slowly. Scrub speed seems to be a lot faster than compared to Debian Stretch, 4.9 kernel and zfs 0.7.12. With zfs 2.0.3 and D10 the scrub speed is reported at about 5G/s with HDD SAS disks and seems to finish in a few hours. With Debian Stretch, 4.9 kernel and zfs 0.7.12 and same kind of hardware the scrub speed is about 2.9M/s and takes days to finish, but no problems with exhausted memory. |
@anssite What architecture are you seeing this on? I imported my pool on an x86 system (same zfs and kernel version) over the weekend and was unable to reproduce the problem. But it happens without fail on ppc64.
One related setting that differs between ppc64 and x86 is zfs/module/os/linux/spl/spl-kmem-cache.c Lines 105 to 109 in d0249a4
Setting it to 16384 on ppc64 helps a lot: I can now run a full scrub (~7 hours). Without it set, I run out of memory in less than a minute. But I don't know if it's the actual culprit or just a red-herring that causes most of the memory to be allocated differently, masking the real issue.
|
@arthurfabre i'm running x86. This is a pool that previously was running on D10 and zfs 0.86 and then the zfs was upgraded to 2.0.3 and the pool imported and upgraded with zpool upgrade. Although the same problem existed before zpool upgrade. Edit: Don't have hardware to test this in fresh installed pool. So it might be that this is relevant only on pools that were upgraded from 0.86 to 2.0.x |
@arthurfabre Does setting |
It seems that zfs_scan_legacy=1 will help. scrub speed is reduced to about 10M/s and memory consumption is minimal. VmallocUsed value during scrub 1.8GB So i guess that the new scan mode is really heavy on resources. Too bad, as it seems to be a lot faster than the old mode. |
@anssite I just tested scrubbing with 4 GB of memory should not be an issue on a modern server (but could be annoying for people using ZFS on their laptops). To me this looks like a bug in SPL's slab allocator. Maybe @behlendorf could help here. |
Thanks for the suggestion @sideeffect42 , I'll check with To avoid any confusion, I think there are two different issues here:
|
@sideeffect42 I do see the issue (memory completely exhausted) with If I manually set |
I've been hitting this issue as well on a system that has > 4K page sizes (64K, ARMv8, in my specific case). Can confirm setting |
I've hit this same issue on ppc64le, and can confirm that setting |
What makes me curious is the commit message of f2297b5:
I wonder how this was determined, so that the values for other common page sizes could also be added to the code. |
I think setting |
For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implemention is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to PAGE_SIZE on systems using larger pages. Since 16,384 bytes was experimentally determined to yield the best performance on 4K page systems this is used as the cutoff. This means on 4K page systems there is no functional change. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12152 Closes openzfs#11429 Closes openzfs#11574 Closes openzfs#12150
For small objects the kernel's slab implementation is very fast and space efficient. However, as the allocation size increases to require multiple pages performance suffers. The SPL kmem cache allocator was designed to better handle these large allocation sizes. Therefore, on Linux the kmem_cache_* compatibility wrappers prefer to use the kernel's slab allocator for small objects and the custom SPL kmem cache allocator for larger objects. This logic was effectively disabled for all architectures using a non-4K page size which caused all kmem caches to only use the SPL implementation. Functionally this is fine, but the SPL code which calculates the target number of objects per-slab does not take in to account that __vmalloc() always returns page-aligned memory. This can result in a massive amount of wasted space when allocating tiny objects on a platform using large pages (64k). To resolve this issue we set the spl_kmem_cache_slab_limit cutoff to 16K for all architectures. This particular change does not attempt to update the logic used to calculate the optimal number of pages per slab. This remains an issue which should be addressed in a future change. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12152 Closes #11429 Closes #11574 Closes #12150
System information
I can also reproduce this using:
But not with ZFS 0.8.6. I can't reproduce it at all on a similar x86 system.
Describe the problem you're observing
When scrubbing a dataset (4 drive raidz2) memory usage rises until all system memory is exhausted, and the kernel panics.
If the scrub is stopped before the kernel panics (
zpool scrub -s
), memory usage drops back to the same level as before the scrub was started.Describe how to reproduce the problem
This script reproduces the problem:
The used memory increases from 8GB to 72GB in 30 seconds, and returns to 8GB after the scrub is stopped.
vmalloc
seems responsible for the majority of this:meminfo.before.txt
slabinfo.before.txt
slabtop.before.txt
spl-slab.before.txt
free.before.txt
meminfo.during.txt
slabinfo.during.txt
slabtop.during.txt
spl-slab.during.txt
free.during.txt
meminfo.after.txt
slabinfo.after.txt
slabtop.after.txt
spl-slab.after.txt
free.after.txt
Include any warning/errors/backtraces from the system logs
Last kernel logs (including OOM killer running) before kernel panics (unfortunately the panic does not get logged to disk):
oom.txt
The text was updated successfully, but these errors were encountered: