-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dedup and large unlink()s or 'zfs destroy' can cause stalls, deadlocks, and other badness #3725
Comments
Hi, deduplication uses a LOT of memory. I would say that even 8 GB for a 500 GB dataset is too small. It´s definetly too small for a 15 TB dataset. So i´d say: Not a bug - insufficient memory. |
Please re-read the report. This is not an issue of the system being merely slow. It deadlocks, it is unable to complete transactions, etc. The DDT is not magic, it's a tree that happens to reside on disk. The destroy operations at least already know how to span transactions ("background freeing"), in theory, so there's nothing fundamentally incorrect about wanting them to span more transactions so that each transaction completes faster and in less memory. |
The DDT is NOT supposed to reside on disk! It is supposed to reside in memory entirely. If it spills to disk everything almost stops. From your report i´d say thats exactly what happened. Also after moving the disks to a system with more memory the issue was solved points to a memory issue.... From your "zpool status -d" output it says: DDT entries 57963728, size 1193 on disk, 168 in core Translated: 57963728_1193 Byte = 64 GB DDT on disk and See Issue #2414 for the same and how to calculate. As you can read all over the internet the dedup feature must be used with care and only if it fits entirely in memory. In your case it doesnt. There is absolutely no wonder your system became unresponsible with many small I/O during the deletion. |
... what?! The DDT is part of the on-disk data structure. It's cached in ARC/L2ARC as with everything else, but it must be persisted to disk. I know, obviously, that having more of it in RAM will make operations faster. But I am emphatically not asking for this operation to go faster. I am asking for it not to CRASH THE MACHINE and make the pool unimportable. |
Yes, the DDT is of course stored on disk. But in your case (see my updated post) it doesnt fit into memory... your memory is WAY too small. When you delete data the system must iterate over the whole DDT and delete any references of the data block that is supposed to be deleted from the DDT. If the DDT is in memory it is stil relatively slow but will work. If it is on disk and does not fit into memory (like your case) the pool has a hard time to find the entries and delete them... that´s where your timeouts come from. A DDT that does not fit into memory is a NO GO! You can use google to find out what happens with dedup, ZFS and insufficient memory. You will find posts about the same behaviour you see all over the place. Edit: Your pool was most likely not really unimportable. It was just performing the destroy you asked for. And as it had to fetch almost all the data from disk this took quite a while. You said you´ve seen a lot of small I/Os... that was the lookup of the DDT entries. |
I understand that the DDT needs to have its reference counts decremented. Do you understand that I would be absolutely OK with this operation taking weeks and lingering in the ZIL as a background thread? I am fine with a solution that decrements one block's reference count per ZFS transaction and keeps the transaction system flowing. The only thing, in fact, that I am requesting is that it not try to do so much work at once that it runs the system out of memory and ends up with gigabytes of dirty data in memory that get spilled back to disk very slowly, stalling the ZFS I/O pipeline. I could give up, yes. But, instead, I was asked to file a bug by one of the authors of this software. I believe we are at an impassible state and will wait for someone else to weigh in. |
@ronnyegner I disagree. Deduplication should never result in the system deadlocking. If it deadlocks, there is a bug. @nwf This needs more analysis before we run down the issue, but I am willing to take a stab in the dark. There is a class of issues involving direct reclaim that can deadlock the codebase. A modern example of this in the current codebase is that direct reclaim can deadlock the spacemap loading code if it triggers writes to zvols. A historical example involves atime updates, which the kmem rework in 0.6.4 fixed. Issues in this class are known to affect swap on zvols, but there is no reason why it cannot affect other things too. Last week, I succeeded in identifying a workaround that eliminates the entire class on recent kernels. It is a blanket application of the kmem rework to disable reclaim on taskq threads in the SPL. If I recall correctly, this relies on kernel functionality introduced around Linux 3.17 and is less reliable on older kernels. This will likely not be merged as-is without additional work because @behlendorf is in favor of attempting more surgical approach than the blanket approach I took. However, if you do not mind being a guinea pig, the patch is here: https://bpaste.net/show/839d038d372c Keep in mind that this is a stab in the dark. It works around the only class of issues that I know to exist in the code that might cause your problem. There is no guarantee that it applies to your situation because there could be unknown issue classes. |
Encouragingly, I have been beating the snot out of this machine, removing large files and so on... and the deadlocks are gone. The I/O stalls are still there, but I haven't had it lock up on me yet! To help with the I/O stall issue, after beating on the machine for a while, I've decided to set Continuing my earlier hypothesis, If it happens that there's a bunch of other write traffic, which there has been in general on this machine, then my artificially low |
I'm going to try testing with nwf/zfs@66dbeba and a larger I don't think this is the right approach -- I'd rather hook the |
@nwf interesting change ! I wonder how that would effect latency (e.g. for desktop usage) during big data transfers (or deletes) |
They are triggering possible recursive locking warnings from lockdep on linux.
See #5449 and openzfs/openzfs#214 for upstream's similar work! |
... and #5706. Given the dramatic improvement I saw with my hack above and the similar approach taken here (with much less of a hackish feel), I think this can be closed. |
How does ZFS decide how much of DDT to store in core, and how much of it on disk? Based on |
@rihadik Please don't commit thread necromancy. This is also not the correct venue for that kind of question. |
Salutations all.
I've been running ZFS and ZoL for a long while now and have been seeing this issue in several guises, but recently it came to a head and @ryao told me to file a bug, so here I am.
We have a box running ZoL zfs-0.6.4-184-g6bec435, whose
uname -a
output isLinux chicago.acm.jhu.edu 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) x86_64 GNU/Linux
. This box has 8GB of RAM, 16GB of disk swap, and a Core2Duo E8400 @ 3GHz, root on a ZoL pool (r
), and a separate 15TB ZoL pool (z
) which holds most of its data. Both pools use ashift=12. There are no ZVOLs to be seen.Parts of
z
havesha256
checksums, dedup (sha256,verify
), and compression (lz4
) turned on. Recently, we had cause to destroy a snapshot of such a filesystem. This snapshot was holding on to nearly 500GB of data (the filesystem had almost entirely diverged from its snapshot, I think; dedup makes this number more complicated to actually compute). In any case, ZFS accepted thezfs destroy
command for the snapshot and began doing its thing. After a little while, the ARCanon_size
approached 1GB, with the disks doing very slow (256K/sec?) seeky operations (DDT?), and the machine paused -- dropped network connections and all that -- for several minutes at least. When I got back in,anon_size
was decreasing at a rate of about 1MB/sec with the disks again appearing to be spending a lot of time seeking. Soon, though, it shot back up to 1GB or so and the machine paused again... never to return.When cycled and told to import the
z
zpool, the machine would repeat this cycle several times; it would be fine, a large transaction (I think) would get built in memory, trip some threshold, get slowly put to disk in a way that stalled almost all other (disk? VFS? ZFS?) I/O on the machine, and then this would happen again and the machine would lock up (ETA: we waited a week or more in one case because no one was around to cycle it and it wasn't on remote power control).zdb -l
showed that the in-label txg counters were increasing while this was happening (or at least, after imports); regretfully, I was not able to getzdb -i
output from the machine.Ultimately, we pulled the disks from the machine and put them into another machine with ZoL (I don't recall the exact version; almost certainly master from some time this summer) and 16GB of RAM and let it import the pool to finish the transaction(s). This finished and we have removed the disks from the surrogate machine and put them back in the original machine, which is blissfully running along with only small, brief I/O hiccoughs at the moment, almost all triggered by large unlink() operations.
While https://www.illumos.org/issues/5911 sounds like it may help the problem, and surely describes some shared symptoms, I am not sure it is the root cause. As a tentative hypothesis, some transaction-sizing logic may not be accounting for the size of mutations of the DDT. Or perhaps the DDT ZAP itself is in some way buggy, causing size estimations to be wrong (
zpool status -D
says "DDT entries 57963728, size 1193 on disk, 168 in core" which sure sounds like there's a lot of padding in the DDT ZAP, but maybe that's intentional).I regret not having much more concrete information to offer and a slight unwillingness to deliberately trigger this again, as I do not have machines that I don't mind crashing like this. When (not if, I'm afraid) it happens again, what should I capture to try to shed some light on the issue?
Thanks for reading this far,
--nwf;
The text was updated successfully, but these errors were encountered: