Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedup and large unlink()s or 'zfs destroy' can cause stalls, deadlocks, and other badness #3725

Closed
nwf opened this issue Sep 1, 2015 · 14 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@nwf
Copy link
Contributor

nwf commented Sep 1, 2015

Salutations all.

I've been running ZFS and ZoL for a long while now and have been seeing this issue in several guises, but recently it came to a head and @ryao told me to file a bug, so here I am.

We have a box running ZoL zfs-0.6.4-184-g6bec435, whose uname -a output is Linux chicago.acm.jhu.edu 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) x86_64 GNU/Linux. This box has 8GB of RAM, 16GB of disk swap, and a Core2Duo E8400 @ 3GHz, root on a ZoL pool (r), and a separate 15TB ZoL pool (z) which holds most of its data. Both pools use ashift=12. There are no ZVOLs to be seen.

Parts of z have sha256 checksums, dedup (sha256,verify), and compression (lz4) turned on. Recently, we had cause to destroy a snapshot of such a filesystem. This snapshot was holding on to nearly 500GB of data (the filesystem had almost entirely diverged from its snapshot, I think; dedup makes this number more complicated to actually compute). In any case, ZFS accepted the zfs destroy command for the snapshot and began doing its thing. After a little while, the ARC anon_size approached 1GB, with the disks doing very slow (256K/sec?) seeky operations (DDT?), and the machine paused -- dropped network connections and all that -- for several minutes at least. When I got back in, anon_size was decreasing at a rate of about 1MB/sec with the disks again appearing to be spending a lot of time seeking. Soon, though, it shot back up to 1GB or so and the machine paused again... never to return.

When cycled and told to import the z zpool, the machine would repeat this cycle several times; it would be fine, a large transaction (I think) would get built in memory, trip some threshold, get slowly put to disk in a way that stalled almost all other (disk? VFS? ZFS?) I/O on the machine, and then this would happen again and the machine would lock up (ETA: we waited a week or more in one case because no one was around to cycle it and it wasn't on remote power control). zdb -l showed that the in-label txg counters were increasing while this was happening (or at least, after imports); regretfully, I was not able to get zdb -i output from the machine.

Ultimately, we pulled the disks from the machine and put them into another machine with ZoL (I don't recall the exact version; almost certainly master from some time this summer) and 16GB of RAM and let it import the pool to finish the transaction(s). This finished and we have removed the disks from the surrogate machine and put them back in the original machine, which is blissfully running along with only small, brief I/O hiccoughs at the moment, almost all triggered by large unlink() operations.

While https://www.illumos.org/issues/5911 sounds like it may help the problem, and surely describes some shared symptoms, I am not sure it is the root cause. As a tentative hypothesis, some transaction-sizing logic may not be accounting for the size of mutations of the DDT. Or perhaps the DDT ZAP itself is in some way buggy, causing size estimations to be wrong (zpool status -D says "DDT entries 57963728, size 1193 on disk, 168 in core" which sure sounds like there's a lot of padding in the DDT ZAP, but maybe that's intentional).

I regret not having much more concrete information to offer and a slight unwillingness to deliberately trigger this again, as I do not have machines that I don't mind crashing like this. When (not if, I'm afraid) it happens again, what should I capture to try to shed some light on the issue?

Thanks for reading this far,
--nwf;

@ronnyegner
Copy link

Hi,

deduplication uses a LOT of memory. I would say that even 8 GB for a 500 GB dataset is too small. It´s definetly too small for a 15 TB dataset.
The moment you delete (or destroy) the dedup table must be cleaned up from any references. This takes a lot of cpu power and is time consuming. If in addition to that the DDT table spiiled to disk (due to insufficient memory) everything gets even more slow ("crawling slow").

So i´d say: Not a bug - insufficient memory.

@nwf
Copy link
Contributor Author

nwf commented Sep 1, 2015

Please re-read the report. This is not an issue of the system being merely slow. It deadlocks, it is unable to complete transactions, etc. The DDT is not magic, it's a tree that happens to reside on disk. The destroy operations at least already know how to span transactions ("background freeing"), in theory, so there's nothing fundamentally incorrect about wanting them to span more transactions so that each transaction completes faster and in less memory.

@ronnyegner
Copy link

The DDT is NOT supposed to reside on disk! It is supposed to reside in memory entirely. If it spills to disk everything almost stops. From your report i´d say thats exactly what happened. Also after moving the disks to a system with more memory the issue was solved points to a memory issue....

From your "zpool status -d" output it says:

DDT entries 57963728, size 1193 on disk, 168 in core

Translated:

57963728_1193 Byte = 64 GB DDT on disk and
57963728_ 168 Byte = 9,2 GB DDT in memory

See Issue #2414 for the same and how to calculate.

As you can read all over the internet the dedup feature must be used with care and only if it fits entirely in memory. In your case it doesnt. There is absolutely no wonder your system became unresponsible with many small I/O during the deletion.

@nwf
Copy link
Contributor Author

nwf commented Sep 1, 2015

... what?! The DDT is part of the on-disk data structure. It's cached in ARC/L2ARC as with everything else, but it must be persisted to disk. I know, obviously, that having more of it in RAM will make operations faster. But I am emphatically not asking for this operation to go faster. I am asking for it not to CRASH THE MACHINE and make the pool unimportable.

@ronnyegner
Copy link

Yes, the DDT is of course stored on disk. But in your case (see my updated post) it doesnt fit into memory... your memory is WAY too small.

When you delete data the system must iterate over the whole DDT and delete any references of the data block that is supposed to be deleted from the DDT. If the DDT is in memory it is stil relatively slow but will work. If it is on disk and does not fit into memory (like your case) the pool has a hard time to find the entries and delete them... that´s where your timeouts come from. A DDT that does not fit into memory is a NO GO!

You can use google to find out what happens with dedup, ZFS and insufficient memory. You will find posts about the same behaviour you see all over the place.

Edit: Your pool was most likely not really unimportable. It was just performing the destroy you asked for. And as it had to fetch almost all the data from disk this took quite a while. You said you´ve seen a lot of small I/Os... that was the lookup of the DDT entries.

http://open-zfs.org/wiki/Performance_tuning#Deduplication

@nwf
Copy link
Contributor Author

nwf commented Sep 1, 2015

I understand that the DDT needs to have its reference counts decremented.

Do you understand that I would be absolutely OK with this operation taking weeks and lingering in the ZIL as a background thread? I am fine with a solution that decrements one block's reference count per ZFS transaction and keeps the transaction system flowing. The only thing, in fact, that I am requesting is that it not try to do so much work at once that it runs the system out of memory and ends up with gigabytes of dirty data in memory that get spilled back to disk very slowly, stalling the ZFS I/O pipeline.

I could give up, yes. But, instead, I was asked to file a bug by one of the authors of this software.

I believe we are at an impassible state and will wait for someone else to weigh in.

@ryao
Copy link
Contributor

ryao commented Sep 1, 2015

@ronnyegner I disagree. Deduplication should never result in the system deadlocking. If it deadlocks, there is a bug.

@nwf This needs more analysis before we run down the issue, but I am willing to take a stab in the dark. There is a class of issues involving direct reclaim that can deadlock the codebase. A modern example of this in the current codebase is that direct reclaim can deadlock the spacemap loading code if it triggers writes to zvols. A historical example involves atime updates, which the kmem rework in 0.6.4 fixed. Issues in this class are known to affect swap on zvols, but there is no reason why it cannot affect other things too. Last week, I succeeded in identifying a workaround that eliminates the entire class on recent kernels. It is a blanket application of the kmem rework to disable reclaim on taskq threads in the SPL. If I recall correctly, this relies on kernel functionality introduced around Linux 3.17 and is less reliable on older kernels. This will likely not be merged as-is without additional work because @behlendorf is in favor of attempting more surgical approach than the blanket approach I took. However, if you do not mind being a guinea pig, the patch is here:

https://bpaste.net/show/839d038d372c

Keep in mind that this is a stab in the dark. It works around the only class of issues that I know to exist in the code that might cause your problem. There is no guarantee that it applies to your situation because there could be unknown issue classes.

@nwf
Copy link
Contributor Author

nwf commented Sep 3, 2015

Encouragingly, I have been beating the snot out of this machine, removing large files and so on... and the deadlocks are gone. The I/O stalls are still there, but I haven't had it lock up on me yet!

To help with the I/O stall issue, after beating on the machine for a while, I've decided to set zfs_dirty_data_max to 64M, which is insanely small (the default value was nearly 900M) but it does seem to be helping (not perfectly; I've seen anon_size creep upwards of 200M and I/O stall for 15 minutes or so, but it really is better than it was).

Continuing my earlier hypothesis, dmu_tx_count_free does not appear to have any estimation of the amount of DDT work that will be required per transaction, so a whole lot of predicted-to-be-small transactions can get dmu_tx_assign-ed to a txg and then they all have to work on the DDT. As a hack, maybe dmu_free_long_range could manually dmu_tx_wait (is it safe to do that without calling dmu_tx_assign first? I cannot quite tell from reading it) under control of a module parameter?

If it happens that there's a bunch of other write traffic, which there has been in general on this machine, then my artificially low zfs_dirty_data_max and the general disk I/O will have the limited side-effect of squeezing dmu_free_long_range out into more txgs; but if there isn't write traffic, we could get a txg with an anticipated cost as large as dsl_pool_need_dirty_delay is willing to permit, but whose actual cost is enormous and DDT-dominated.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 3, 2015
@nwf
Copy link
Contributor Author

nwf commented Sep 3, 2015

I'm going to try testing with nwf/zfs@66dbeba and a larger zfs_dirty_data_max so that we can still get large writes except when large deletes are happening.

I don't think this is the right approach -- I'd rather hook the dmu_tx_hold_free / dmu_tx_assign machinery directly, but this was an easy, non-invasive place to get my fingers in the control flow. ;)

@kernelOfTruth
Copy link
Contributor

@nwf interesting change !

I wonder how that would effect latency (e.g. for desktop usage) during big data transfers (or deletes)

@nwf
Copy link
Contributor Author

nwf commented Dec 4, 2016

See #5449 and openzfs/openzfs#214 for upstream's similar work!

@nwf
Copy link
Contributor Author

nwf commented Feb 1, 2017

... and #5706. Given the dramatic improvement I saw with my hack above and the similar approach taken here (with much less of a hackish feel), I think this can be closed.

@rihadik
Copy link

rihadik commented Oct 17, 2019

How does ZFS decide how much of DDT to store in core, and how much of it on disk? Based on vfs.zfs.arc_max maybe?

@nwf
Copy link
Contributor Author

nwf commented Oct 17, 2019

@rihadik Please don't commit thread necromancy. This is also not the correct venue for that kind of question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants