-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random kernel BUG at mm/usercopy.c:99 from SLUB object 'zio_buf_comb_16384' #12543
Comments
20.04 is shipping 5.4.x, and the HWE kernel is 5.11.x at this point (20.10 was 5.8.x). So you seem to be running the old 20.10 HWE kernel on 20.04, not "just" 20.04. What version were you running before? What are the dataset/pool properties on one of the systems where this happens? What kind of workload are you running? |
I was originally running 5.13.12 mainline (which doesn't have a debug symbols package) that was producing the same panic. I then switched to 5.11.x HWE which was producing the same panic, however due to this bug https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.11/+bug/1939287 it again does not have a debug symbols package. I also tried 5.10.x however that also has the same debug symbols package missing. So 5.8.x was the latest Ubuntu kernel I could find which has debug symbols and was still producing the same panic. Which is why I'm running that instead of 5.4.x. The pools all have identical properties to this:
The workload is very heavy random read/writes on SATA disks, it's a multi-tenant container system with a wide range of apps, so DBs, file transfer, streaming data, workstations and more. Around 7% of the disks we run are at 90%+ utilisation most of the time. |
Okay, that's some data. What about enabled features on the pools? (e.g. Are you exposing any of them over NFS/SMB/iSCSI? (I just touched this code for an NFS issue in much, much older kernels, though that patch isn't in 2.1.0 and I don't think was remotely related to this issue, I'm just wondering if there were more gremlins lurking...) |
Here's the features:
There are 12 disks in the system, all single-disk pools, and all filesystems. The pools share the disk with RAID1 & RAID10 mdadm partitions (ext4). We're not exposing them over any interface such as NFS etc. We also have the following udev rules:
The only zfs parameter we have set is: |
I've tried downgrading to: zfs-2.0.4-1 However we're still getting panics:
They always appear to happen directly after an OOM kill:
I have core dumps of a few crashes, please let me know what information you need to help debug the issue. |
Interesting. So, based on the issue you referenced this one from, you're triggering an OOM based on a program ignoring cgroup limits and getting murdered when the system is otherwise not RAM-constrained. It seems like docker being involved is probably worth mentioning in bug reports, FYI - it has a way of complicating many otherwise simple things in ways that are difficult to reproduce without it unless you know it's involved. |
It looks like the PID which triggered the bug is a thread of the previously killed proc:
|
OOM kill was disabled however the bug still persists, so it is not related to the OOM kill event. |
If you could put together a minimal reproducer you could share all the
components of so someone could just run it and have the bug appear with
even some percentage chance, it would likely be helpful.
…On Tue, Sep 21, 2021, 1:26 PM Grant Millar ***@***.***> wrote:
OOM kill was disabled however the bug still persists, so it is not related
to the OOM kill event.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12543 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7K3LDZI2IZL5KT4ZN3UDC543ANCNFSM5DSRJCZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
There was another panic this morning hitting the same bug, this time triggered by a different process:
So far I have tried to reproduce it using:
I tried this with and without the OOM killer, when the OOM killer was enabled I set the oom_score_adj of the io threads to higher so that they were being killed rather than the vm thread. However I haven't yet been able to reproduce it myself. |
We're still hitting this bug on a daily basis, here's the latest crash from this morning:
We are currently trialing our own custom OOM killer which uses sigterm instead of sigkill, however the above crash happened despite this, so either the OOM killer was not fast enough (we'll try adjusting the params), or it's not related to hitting the cgroup RAM limit. The trace looks slightly similar to the following: https://lkml.org/lkml/2020/7/26/73 We may try using the kernel before this commit to confirm. pinging @behlendorf for any insights. |
We have implemented a custom OOM terminator in golang which uses fsnotify and containerd/cgroups to monitor docker containers for memory events, once a container starts emitting critical memory pressure and is over a certain threshold, we then send a SIGTERM to the PID with the highest oom_score. This works for most applications which gradually use more RAM, the only situation we have at the moment is where an application spikes it's RAM usage faster than we can kill it and that appears to only happen with ffmpeg. This still results in crashing 1-2 times per week, at the moment that seems the best we can do, and it's much better than the multiple crashes per day. We would still like to debug this further if anyone is able to help. |
This one is slightly different, the slub object here belongs to a different container (a1b14e7ba3a2e170d9a0b4475c61c541811de9f2c19d4d68929c71916b33271c) than the proc mentioned in the bug which belongs to container (86b70f49d51a778366e5cc853f7c9107911b551decb0164dd0de9867cef88fd9). There didn't appear to be any memory pressure on the containers. I've started running bpftrace on some of these procs in order to debug further.
|
So the slub object has a limit of 16K: zio_buf_comb_16384 171771 183223 16384 2 8 : tunables 0 0 0 : slabdata 92951 92951 0 and it looks like we're attempting to write objects into it over 20K, I can see this code was changed recently in: 7837845#diff-8bd8fe4c03392e93d0678c0b0f2437252d2dcd1073772c60358cbb2f384bc8de However as we're on amd64 linux that shouldn't have changed anything as afaik our page size has always been 4K. We're going to try setting |
For anybody else hitting this bug in the future, it was resolved by setting Here's the trace of the bug through the stack: https://github.com/torvalds/linux/blob/923dcc5eb0c111eccd51cc7ce1658537e3c38b25/mm/slub.c#L4524 zfs/module/os/linux/zfs/zfs_uio.c Line 172 in 60ffc1c
zfs/module/os/linux/zfs/zfs_uio.c Line 205 in 60ffc1c
I think the bug can be fixed by properly checking the object size against the slab limit, but I would guess that an object larger than the limit should already not be possible? If so could someone point me to the code which checks the object size? cc @rincebrain |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Unmark stale |
After upgrading to 2.1.6 last night we hit #11574 on 50+ servers. All servers were locked up requiring a reboot into rescue mode as reboots were causing infinite lockups due to OOM. We're currently investigating how to get the servers back to a stable state. |
The two easy ways I can think of to make scrubbing OOM a system is by having lots of pools (since currently the new scrub caps each scrub to 10% of total system RAM, with no global maximum usage), or with memory fragmentation. Do either of those sound plausible? Either way, you could try |
@rincebrain we have 12 pools on each server, and the RAM is always under pressure, so 10% usage would add up to 120% RAM without any other processes. We're testing |
There's an internal variable for this threshold, So that'd be 60% of system RAM for you, I imagine. It's not exposed as a tunable, but would be simple to doso. You could also just import one pool, pause scrub, repeat, and then stagger running them, though I would agree that shouldn't be necessary. |
System information
Describe the problem you're observing
Kernel randomly panics, our fleet has experienced 13 panics over 2 weeks since upgrading to zfs 2.1.0.
Describe how to reproduce the problem
Install zfs-2.1.0-0york2~20.04 on Ubuntu 20.04.3, run a high load.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: