-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS 0.8: kworker sucking up 100% CPU, various messages in logs mentioning ZFS, system eventually hangs. #9430
Comments
check smart info for all disks if some of those is about to fail, also watch out for scsi/disk errors in dmesg, maybe this is an error occuring after some disk hiccup. also make sure there is enough free space on both pools |
Both pools are nowhere near full. SMART data on all disks is fine. If the problem was caused by the disks, I would expect to see high iowait. However, there is no iowait. |
I would suggest trying v0.8.2 since there was a kworker/rollback deadlock fixed in that release. |
Alright, I'll give that a try. I see #7038 was referenced there, and it looks to me like this may be the same issue as that. |
I've updated the system to 0.8.2. I think I've pinpointed the thing that's triggering the problem (snapfuse in an LXC container spinning, doing lots of I/O, sucking up an entire core) since that seems to have popped up around the same time, but I'll forgo fixing that for a few days to see if I can confirm if 0.8.2 fixes the problem. |
Alright. Prior to updating to 0.8.2, it took 12-24 hours for this issue to crop up, and it's been 2 days with no sign of it, so I'm inclined to believe that this is now fixed. Thanks! |
I've run into this with 0.8.2. |
Actually, yep, looks like last night it occurred again. Based on my sample size of 1, 0.8.2 seems to have improved it, since it took 4 days to occur rather than less than one day, but it has occurred again. The system is currently still responsive. Before it completely locks up and forces me to reboot, and before I just disable the VM that I am fairly sure is causing the problem, is there any debugging info I can gather to assist in determining the root cause of the problem? |
Perhaps attach zpool status and zfs get all output?
lz4 compression is enabled, defaults otherwise. |
This has just happened again:
While attempting to compile libcap.
I'm tempted to go back to using git master, when I was on zfs-9999 just prior to 0.8.2 release I didn't see this (was using git build as 0.8.x didn't build against 5.2.x kernel at the time) and I did note that when I switched to stable 0.8.2, I had unsupported "future" features on my pool (I rebuilt it so I had 0.8.2 clean). |
Might it be worth comparing some system configurations?
The disks are connected to a generic LSI SAS2 HBA. The rpool is a single Crucial MX500 SSD connected to SAS port 0 on the HBA. The entire storage pool is encrypted. The disks in the storage pool are Hitachi Ultrastar 7K4000 SAS HDDs. They are connected to ports 4-7. The system is a Dell PowerEdge R415, with 2x AMD Opteron(tm) Processor 4365 EE processors, and 80 GB of RAM (4x4GB + 4x16GB). I've narrowed down what I believe to be the cause (a broken snap package inside of an LXC container; container is on storage pool. snapfuse is sucking up an entire core and reading from disk at a constant roughly 6 MB/sec), but I have avoided fixing the issue to both verify the issue was fixed (it's not), and to assist in diagnosing the root cause. The next time this system completely locks up and I have to hard reboot it, I'm fixing the problem that I believe is triggering it, so if there are any commands I can run or logs I can fetch that would assist the ZFS developers in determining the root cause, I need to know what those are sooner, rather than later. |
What brought you to the snap package? I ask as I have one on my entire system that as it happens, is active when the issue occurs. At this point I'm sure it's coincidence, buuut...
System is a Threadripper 1950X, 64GB RAM, ASRock mobo. Samsung NVME disks in rpool, ten year old WD Black and a couple of SSDs in misc. Kernel 5.2.16. It's rpool that's active and hangs when the issue occurs. |
|
could you describe what snapfuse is/does? there seems not too much information around on the net... i don't find a manpage or a git repo... |
snap packages are distributed as squashfs images. Normally, those are just mounted using the native kernel support as loopback devices, but in my case, I have snap running inside of an LXC container. You can't use loop devices inside of LXC. As a workaround for this, snap comes with snapfuse, which is an implementation of squashfs in userspace using FUSE, which is supported inside of LXC containers. |
Hmm, fuse you say...
|
It looks to me like the only things actually hanging here are
|
I'm going to replace |
I'm unsure, I only saw emerge and gzip "hang" last time but forgot to grab a full process list before rebooting (needed system back ASAP). When it happens again I'll check. Does it matter re: sync if it's just replaced with true or something when it comes to ZFS? |
I'm not sure. It's certainly not optimal but I would assume it can't really get any worse than the process just hanging in uninterruptible sleep. |
mind that franz has severe memory leaks |
i think this is normal as with every hanging sync there is one more process waiting for I/O and every process in the run-queue waiting for completion will add up to the system load (it does not need to burn cpu). you can easily demonstrate this with stale nfs mounts |
instead of rebooting please try stopping all lxc stuff and try to unmount all fuse-mounts, if that doesn't work normally , try -f (force) then have a look if the sync processes go away or try to kill those maybe the sync is hanging because of fuse... |
When sync hangs in D state inside the container, you cannot stop the LXC container. Any attempt to do so will just freeze. Anyway, I will say, 0.8.2 has certainly at the very least made this problem slightly better. On 0.8.1 it was consistently occurring on a daily basis. 0.8.2 it went about 4 days the first time, and I'm at almost 5 days since then now and it's still working fine. I'll update the issue when it next occurs. |
Nope. By definition, when a task is in D state (uninterruptible sleep), absolutely nothing (short of a hard reboot) will end it. Because restarting an LXC container requires killing all processes, and you cannot kill a process in D state, you cannot restart an LXC container with any child processes in D state. |
This has hit me again. Anyone experiencing this on kernels <5.0.0? |
I'm experiencing this on |
We were hit by the problem on 4.19 relevant thread shows after sysrq: [6756902.750815] CPU: 12 PID: 37541 Comm: kworker/u49:4 Tainted: P W O 4.19.86-gentoo #1 |
Looks like the kworker thread is spinning and reading metadata, while not committing to ZIL. ZIL commit count is slightly reduced since the event at 5/3 2:20 A.M. After a while more and more programs hang behind this kworker thread as one can see in the load graph. The gap in the graphs is not correlated with this particular system. There was a reconfiguration of the monitoring database. |
Mhm, maybe I should've posted here instead of #11754 (comment) Did everyone else see this resolved somehow? @alexanderhaensch your reported setup was very similar to mine. |
Encountered this today. Very similar traceback as reported above.
log: https://gist.github.com/WhittlesJr/e15487bf5b2a9835249c655ca424faf0 top:
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
this does not look like being resolved. please reopen |
System information
Describe the problem you're observing
After some hours of system uptime, a
kworker
process will suck up 100% of a core. The system load averages will climb constantly. This kworker process appears to be ZFS related (as of this writing, the current CPU hog is namedkworker/u33:3+flush-zfs-19
).This system was originally on Ubuntu Server 18.04 with ZFS 0.8.1 that I had compiled some time before. It was at several months of uptime with no issues. Out of nowhere, this issue came up.
Things that have been tried:
I've tried everything I can think of, and since various ZFS-related terms come up in the kworker name and messages printed to dmesg, I am left to assume it is a problem in ZFS.
Describe how to reproduce the problem
Unknown. The system this is occurring on has been stable with minimum in the way of changes for about a year. The system is a host to various LXC containers. The root filesystem is ZFS, on a single 500GB Crucial SSD. All data for the LXC containers is stored on an encrypted storage pool. Output of zpool status:
Include any warning/errors/backtraces from the system logs
Not sure what would be helpful for debugging, but here's some stuff that was printed to dmesg:
I can grab any other debugging information that would be helpful, just let me know what commands to run to grab it.
EDIT: Oh, and I suspect it may have SOMETHING to do with one particular LXC container. The container is stored in a ZFS dataset. Inside the container, the only thing running is Nextcloud, installed from Snapcraft.
The text was updated successfully, but these errors were encountered: