-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when writing to ZFS 2.0.1 filesystem from Docker container #11476
Comments
@Do-while-Bacon: when you say "zfs volumes" do you mean ZVOLs or volumes as files in the ZFS filesystem? |
@sempervictus Sorry for the confusion. I just meant passing through directories on a ZFS filesystem to a docker container. |
This may be the same issue I, and some others have been experiencing since upgrading to ZFS 2.0.1 on unraid. We've even gone to the extent of separately building a kernel with both 2.0.0 and 2.0.1 separately and swapping them out. The issue is repeatable in that on 2.0.1 it always occurs and on 2.0.0 it always doesn't occur. This prevents us from using 2.0.1. I can post logs if need be - the unraid thread starts here but the basics are that the docker tab stops being responsive and in one case docker is exhibiting 100% CPU behaviour. |
@marshalleq This may be asking a lot, but would you be willing to bisect the responsible commit? |
Hey, I would be willing to do most things, but in this case I'm not actually sure what you're asking. :) |
Ah! This is performing a binary search in the commit history. You've established that 2.0.0 is "good" and 2.0.1 is "bad", and then you compile halfway through the commits---and test again. Then you can figure out if the culprit commit was before or after the midpoint you tested. This will take about ~6 re-compilation/re-tests (because 2**6 is roughly the number of commits between 2.0.0 and 2.0.1). Here's a tutorial (for wine, but you can just skip the wine-specific stuff---you already know how to compile zfs).
|
Oh I see. OK, I'll see if I can do that, no problem. I may have to wait til the weekend as I'm away for work, but one of the other Unraid guys may also be able to help. Thanks. |
I am having the same problems with unraid, can only tell its happening when running some docker containers (like amp-dockerized or jdownloaded) and others not. Problem is not appearing in 2.0.0 and started with 2.0.1. Cpu usage is pinned to 100% making the server basically unusable and forcing one to hard-reboot it to use it again. |
@aerusso I tested a bit and build some custom kernels based on a few commits and i figured out, the issue started with commit 1c2358c I rly hope this helps to identify this problem and help to fix it. |
PR #11484 contains the proposed fix which applies cleanly to OpenZFS 2.0.1. Any help verifying the fix would be welcome so we can get it applied for 2.0.2. |
Alright, going to build a custom kernel based on that commit, this might take about half an hour - 1 hour, will give a result asap |
Ok, i might have made a mistake building the kernel somewhere, but it got built. Uploaded that to my server and no, the problem is not fixed for me. I will try to build the kernel again, maybe something went wrong somewhere, but i will come back here with a new result |
@Joly0 what kernel version are you testing with? |
5.10.1 |
Ok, i rebuilt the kernel and made sure, its using the right commit, but still my cpu gets pinned to 100% when starting amp-dockerized (didnt check with any other container). Something i noted was zfs --version gives back "zfs-2.0.0-rc1 zfs-kmod-2.0.0-rc1", is that correct, or did something fail with my build again? |
I cannot easily test this on my NAS because it is, well, uhh, let's just call it "production." I tried to reproduce the issue in VMware by creating a fresh 7.9 VM, installing kmod ZFS 2.0.1, creating a RAID-Z2 out of six virtual disks, added two more virtual drives as the L2ARC and SLOG, and set options to match my NAS (xattr=sa, compression=lz4, atime=off). I started up the Sabnzbd container in docker CE 20.10 and downloaded several nzb binaries. Unfortunately, all the downloads completed successfully and I was not able to reproduce the deadlock. I'm not sure what's missing in my sandbox. Perhaps a hardware factor to this that VMware cannot replicate? |
Ok, someone built the kernel for me again, it should have worked, but still, some of my cpu cores or all get pinend to 100% and my server is basically unusable when using some docker containers, i cant stop those containers, stop the array or even restart the server unless in hard-reboot it. |
@Joly0 @Do-while-Bacon thanks. I should have mentioned that the fix is only relevant for kernels older than 4.9, like the CentOS 7.9 3.10.0-1160.11.1 kernel. Newer kernels shouldn't have ever been effected by this particular issue. If you're able to reproduce a similar issue with a 5.10 kernel then it sounds like a different but possibly related issue to be investigated.
That's definitely possible in this case. If someone can recommend an easy way to reproduce the issue with the patch applied that would be helpful. |
@behlendorf Anything i can help with investigating my issue? Should i open up a new bug report or is it ok to stick with this one, as it might be related? |
@Joly0 I think we can track it here for now. I haven't been able to reproduce the issue you're seeing. If you could distill it down to a minimumal reproducer using docker, or something else, and a stock kernel from some major distribution (Ubuntu, Debian, Fedora, CentOS, etc) that would be very helpful. |
I've applied the proposed fix as patch to the 2.0.1 tag and compiled a set of kmod packages. Will report back as soon as I can find a maintenance window to update my system. Thanks! |
I might be running into a similar issue. I've been having issues with ZVOLs being used by VMs locking up my system ever since the last kernel upgrade. I'm using Arch Linux kernel version 5.10.7 with zfs-2.0.1-1. I could reliably reproduce the complete system freezes within minutes. Switching to the lts kernel which is 5.4.90 solved the issue and its been running stable since then. If you think this might be a similar bug I can try to help and figure this out, if you think its something else probably I will open a separate issue. Here's an excerpt of the kernel errors I was getting, the result varied from lockups that took minutes to resolve to complete kernel panics: EDIT: Appears the issue is not resolved by 5.4 after all, it might be related to this issue: #9130 as I am experiencing similar issues with hung tasks. |
I was able to test PR #11484 today and it resolved the problem for me. Thank you @behlendorf Perhaps this issue should be closed and the kernel 5.10 users should open another? |
@Do-while-Bacon thanks for confirming the fix. We'll get it in zfs-2.0.2. To be clear this fix is only relevant for users with 4.9 and older kernels.
Since it seems like there are two issue going on here this makes sense to me in order to avoid confusion. @Joly0 would you mind opening a new issue for the problem you're seeing. |
System information
Describe the problem you're observing
After updating my NAS from ZFS 0.8.5 to 2.0.1 and performing a zpool upgrade, I started encountering an issue where my docker containers (running local on the NAS) that mounted ZFS volumes would deadlock at 100% CPU as soon as they tried to perform any heavy write operations to the pool. I observed this behavior with both the linuxserver.io Sabnzbd and qBittorrent containers. The containers would appear to function normally until I tried to download a Linux ISO, then, the download would get stuck, container would lock at 100% CPU, and nothing would work to kill or stop the container until I rebooted.
I was able to work around this issue by downgrading ZFS packages to 2.0.0. Everything is working correctly again.
Describe how to reproduce the problem
Create a RAID-Z2 pool using OpenZFS 2.0.1 on CentOS 7.9 (my pool has a both an L2ARC and a SLOG device)
Install Docker CE 20.10 (problem occurs with 19.03 too)
Launch a linuxserver.io Sabnzbd container, passing a ZFS volume to /config and /downloads
Attempt to download a NZB
Download will begin and then immediately deadlock
Include any warning/errors/backtraces from the system logs
There was no relevant log output from the docker application or in syslog, however, I did strace the process while it was locked at 100% CPU and it was repeating this system call over and over:
The text was updated successfully, but these errors were encountered: