-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
live-lock in arc_reclaim, blocking any pool IO #4166
Comments
@BjoKaSH thanks for posting this. This does look like the second issue currently being discussed #4106. The stacks are slightly different but more clearly show the deadlock. Here's the critical bit from the stacks you posted with annotations for clarity.
I think what has happened is clear, the mystery still lies in exactly how it happened. If your familiar with the kgdb or crash utilities it would be helpful if you could dump the contents of both the hash_lock mutex and the b_l1hdr.b_cv as a starting point. |
@behlendorf |
Sooo, I am probably too late, but anyway, here comes the data (all from the hash_lock
hdr and b_l1hdr.b_cv
b_l1hdr.b_cv
Took a bit longer to locate the data. Turned out kgdb doesn't work with that box, it doesn't have a serial port. So had to use plain GDB against /proc/kcore and "unwind" the stacks manually (by looking at "disass ..." and memory dumps of the stack page), which is a bit messy. The box is still alive, so I can try to retrieve more data if still needed. |
If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue openzfs/zfs#4166 Issue openzfs/zfs#4106
@kernelOfTruth I tried once in a while, but Debian packages are largely black magic to me. And anything dkms related is deep black magic. Nevertheless I'll try again in the next days. Right now I wonder if there is a way to untangle the two threads, such that I can get a clean shutdown instead of just power-cycling the box. Or if any attempt to modify the state would always cause disaster. @behlendorf was the contents of the hash_lock mutex and the b_l1hdr.b_cv above still useful, or do you need anything else? |
@BjoKaSH thanks for the debug, we got what we needed from the back traces. The fix for this has been merged to the zfs master branch. You can roll custom debian packages if you need to: http://zfsonlinux.org/generic-deb.html openzfs/spl@e843553 Don't hold mutex until release cv in cv_wait |
@BjoKaSH it looks like there are weekly builds available at https://launchpad.net/~zfs-native/+archive/ubuntu/daily So I assume the next build should include these fixes & latest master changes |
If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue openzfs/zfs#4166 Issue openzfs/zfs#4106
If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue openzfs/zfs#4166 Issue openzfs/zfs#4106
After upgrade to 0.6.5.3 (Ubuntu 14.04 with ZFS 0.6.5.3-1~trusty) pool IO hangs on moderate load with arc_reclaim in state "D" (but slowly spinning).
This may be related to issue #4106, although workload and stack traces are different. In my case arc_reclaim is using very little CPU (~1%) and no NFS is involved. Stack trace of processes accessing the pool contains arc_read and further down buf_hash_find.
The box is still running and at least partially responsive. If helpful, I can (try to) attach (k)gdb and try to get some more insights on the internal state of ZFS. But I would need some guidance as I am unfamiliar with the ZoL code base.
Workload causing hang:
The box was doing a "rsync ... --link-dest /some/zfs/dir /some/zfs/otherdir " receiving a file tree of some 600000 files and directories from a remote box. The involved ZFS dataset is deduped with a dedup ratio of 2.1
System state:
The ZFS pool is a 2-way mirror:
The box has a second pool, but that was not used at the time. I have not tried to touch the second pool in any way in order to prevent further lock-up. Its state is unknown.
The box has 16GB of RAM, of which 5GB are currently free
Kernel is 3.13.0-71-generic #114-Ubuntu
The box has one CPU with 2 cores. One is idle, the other sitting in "wait" state, atop output:
The load of "7.03" corresponds to the seven tasks in state "D" (excerpt from "top"):
The hanging zpool command was trying to start a scrub of the pool (run by cron)
The find was manually started to lock for some files. Not coming back lead to discovery of the hang.
The rsync got stuck some days earlier, but went unnoticed (run by cron) until the interactive find didn't respond.
Interestingly, I could "ls" the base of the tree searched by find right before starting find. Maybe the data was already in the cache.
There are 935 ZFS related tasks:
I have no idea where the z_unmount comes from. There should have been no mount / umount activity going on.
Relevant stack traces according to sysrq-t
arc_reclaim : stack trace (1)
(Yes, it hangs since at least two weeks - I was away I couldn't investigate until now.)
z_wr_iss : stack trace (2)
txg_sync -> stack trace (3, the short running)
txg_sync -> stack trace (4, the long running)
rsync
find
The text was updated successfully, but these errors were encountered: