-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when writing to ZFS 2.0.1 filesystem from Docker container (Kernel 5.10) #11523
Comments
Just leaving a comment here that I too am having this issue, the whole docker screen becomes unresponsive until I downgrade to ZFS 2.0.0 which I am now running. I have a spare machine I can test with, but on that note, this occurs on both machines (one AMD Threadripper 1950X and one Intel Xeon Dual CPU machine). Same details as Joly0 has posted in regards to what I'm running it on. |
When the system gets stuck in this state is anything logged to [Edit] I've done some manual testing on Fedora 33 with the distro 5.10.9-201 kernel running docker on top of ZFS. Unfortunately, so far I've been unable to reproduce the issue with any of the common containers I've tried. |
Hm, dmesg doesnt tell me anything, only ps aux tells me there is a process called loop2 that has 40% cpu usage while the server is doing nothing. Will try to investigate further, but as i am no linux expert and doesnt know all the commands and exactly what where to look for, its hard for me |
I think i just found out something very veryintersting: It looks like (although i just tested it for 5 minutes) this issue happens only, if the docker.img file that docker creates is on the zfs volume. When i copy that over to my btrfs volume i have no such problems. So cpu usage is normal, i can start/stop the containers without a problem and restart. But as i said, i only tested this now for a few minutes and only with one container. @marshalleq Can you test and confirm that pls? @behlendorf I hope this is helpful |
For others watching, unraid has the docker core files contained either in an optional directory, or more typically within an image, which is probably the loop service mentioned above. The config, db etc that typically goes with each docker image is not stored in this however. In unraid it is known that you can't put this core docker config into a directory if that directory is contained on ZFS (apparently there's some ZFS driver that needs to be compiled in somewhere - could dig up that info if necessary). However, docker has always worked on ZFS in the past if it was retained within the image - so this ZFS version has definitely got something different about it. Also, that docker image can now be formatted XFS or BTRFS and neither have I known to cause issues when placed on ZFS in the past - perhaps we were lucky. I'll give it a go soon to confirm your findings on my system @Joly0 and report back. |
@Joly0 please bear with me - I actually have no non-zfs volumes within unraid now, only a dummy USB stick to convince unraid I have an array. And my docker.img is bigger than that. But I have a plan which will bear fruit shortly. |
OK now done. Confirming that so far, Unraid 6.9.0-rc2 and ZFS 2.0.1-1 are installed and docker image installed on a non ZFS drive. Docker tab so far is no longer unresponsive. |
@behlendorf are you able to explain what the commit @Joly0 mentioned above does in simpler language? I tried searching for it for a while, but didn't find anything. I'm trying to understand the relationship between having a loop device / image on ZFS and the commit above. At least in theory, then we can try to find out if this is zfs 2.0.1-1 triggering something unraid specific or zfs specific entirely. Thanks. |
@marshalleq sure let me try and summarize. The referenced commit was needed to resolve a regression to Why this would only cause an issue for some docker containers on unraid isn't at all clear. To my knowledge this is the only environment where we've seen these problems. It's possible those containers are doing something slightly non-standard which is causing the issue. If so, we'll need to somehow identify what that is. But the updated code has passed every functional test we've thrown at it, so what that might be isn't clear. |
I tested with ZFS 2.0.2 and i can tell, that the problem still persists. As soon as the docker.img file is on the zfs-array and some docker containers are started, the whole system is basically unusable. |
Btw, i can tell this is still a problem in the latest openzfs version 2.10 using the latest unraid version 6.9.2 |
I have this problem also on 5.10, zfs 2.0.5 and mirrored Toshiba KXG60ZNV1T02 NVMEs... but it's much harder to pinpoint. After days or even weeks running normally it will happen suddenly. Processes can't be killed and some things won't be able to write to disk while other still work for some reason. I found one of the disks in the pool will be unresponsive to scheduler changes or attempts to reset. After a reboot, everything is fine. |
Continuation of this bug report #11476 for Kernel 5.10
Distribution Name | Unraid / Slackware
Distribution Version | 6.9 RC2
Linux Kernel | 5.10.1
Architecture | amd64
ZFS Version | 2.0.1
SPL Version |
Describe the problem you're observing
When upgrading zfs to 2.0.1 with kernel 5.10.1 some docker containers (in my case amp-dockerized and jdownloader, havent tested any other) pin the cpu on some or all cores to 100%, causing the system to be basically unusable. I cant restart the server, cant stop the containers (cant kill them even) or do anything. I am forced to hard-reboot the server.
When downgrading to 2.0.0 or kernel 5.9.13 this problem does not occur, but on the other hand, when using 2.0.0 with kernel 5.9.13, my lancache cant write to the zfs volume as some syscall are missing? (atleast thats what some people on there discord told me).
This bug started with this commit 1c2358c
Describe how to reproduce the problem
Update Unraid to 6.9 RC2, install zfs 2.0.1 with the zfs plugin, create a zfs array and start a mentioned container (amp-dockerized needs a licence)
The text was updated successfully, but these errors were encountered: