-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Deletes & Memory Consumption #6783
Comments
Deleting that much data in a single go while having dedup enabled is known to cause issues such as this and to be quite memory and time hungry. AFAIK this is not an issue as there's plenty of warnings in all the documentation regarding deduplication, and I also recall a few instances in which pools had to be moved on a platform with a greater amount of RAM to successfully finish a I'd suggest to disable the L2ARC at least temporarily in an attempt to drop the need to store the metadata for it in the RAM, and possibly to bump up the amount of it in the system. |
Some updates. There is no L2ARC, my mistake. It used to have one but it was removed. Blocking the mount point did indeed allow the import to complete in about one hour with minimal memory usage (about 9GBs). There does not appear to be any pending IO after several hours. It is accepting zdb commands and zpool modification commands. At this point I am not sure what the deal is, this is really odd! As for memory consumption and dedup, yep, I am aware. However, I have 81.2M allocated blocks so the entire DDT should be under 29GBs. Even if you use the referenced block count (102M) and 512 bytes rather than 380 that is still under 49GBs. So with 98GBs (right now 88GBs free) that should not be an issue. I am more confused by the change in behavior based on the mount state. I am 90%+ confident that is I export, move the folder, and re-import that it will tank the box… In #3725 this was discussed and the issue was supposedly patched. He was using 0.6.4, and after more digging, it sounds like this ( #5706 ) was fixed in zfs-0.7.0-rc4 and may be a starvation issue causing memory to not be freed. I think I am going to have to block the mounting of the drive and try upgrading… But I am still really confused by the change in behavior due to the mount point location being available... |
Alright, it looks like the pending transaction is process on mount and not import. So that explains that... Looking at upgrade from source process. :/ |
Installed 0.7.2.3, started the mount several hours ago. Looks like we are on our way to a lockup, but I will get one last look at it tomorrow morning. Here is data from my current slabtop: slabtop -o -ss |
And from /proc/spl/kstat/zfs/arcstats:
|
No one wants to chime in on this? This has been a known bug in the past. A 1TB delete really should not be a issue with this much memory... I have been online for about 24 hours on this last run (last one died around 10 hours in). It's still crunching along at the moment. Memory right where I would expect it. I am using 24GBs of ARC for 81.2M DDT Entries. So that sounds perfect for the math! 81.2M * 320 / 1024^3 = 24.2GBs arc_summary:
|
The math might not be as simple as |
Yep, totally understand. From everything I have read, and what the devs have stated here, ZFS should slow to a crawl, not crash the box (as it needs to constantly go to disk for every meta op). And it’s so common for ZFS to be under build for RAM and then folks ask why it doesn’t work. So I tried to take the max possible usage and then double it (really, it just worked out that way). This is a prototyping box for other builds, so I am less interested in getting it back online as I am on ether specing the system proper, or fixing whatever the issue is. Really, I can get this box back online by just rebooting it a bunch / adding a bunch of RAM (I think). Last night, shortly after making that post consumption spiked rapidly and tanked the box. So, what I am going to do it install Splunk and see if I can come up with a script to watch all of the key details. I am still not sure exactly what is going nuts at the end. So I will try and watch the slab, arcstats, iostats, and anything else I can think of. Some of this will take come work on my part due to formating. Any recommendations on what I should be capturing? Also, looks like my tunables got chopped in editing on my last post! OOPS!
|
One of the clues here is the 12537657 objects in I suggest that anyone contemplating dedup view Matt's talk at https://www.youtube.com/watch?v=PYxFDBgxFS8 (slides at http://open-zfs.org/w/images/8/8d/ZFS_dedup.pdf). One of the key points is that "on-disk hash tables suck". The ZAPs in which are the structures in which they're stored are a glorified hash table but much more complex. Here's a flame graph that demonstrates how much time is spent spinning for locking when deleting a meager 100GiB (approx) deduped file: https://gist.github.com/dweeezil/d663bb371cb927a4f3ccc4d124effbd3 I'd also like to reference #6823 and also any other issue which involves deleting large files with dedup enabled. Without turning this into a full-fledged wiki page, here's a few suggestions to anyone wanting to deploy dedup: First off, giant files can be a real problem. If your application involves huge files consider whether dedup will buy anything at all. Large blocksizes can help because they'll reduce the numbers of entries in the dedup ZAPs for large files. Dedup can be very useful in certain circumstances but it requires a great deal of understanding to determine what those actually are. Finally (almost), better dedup will probably happen (see Matt's talk). It does seem that something could be done to help the delete situation with the current dedup situation. A new delete throttle to prevent too many zios from piling up seems like it would help quite a bit. |
Sorry for the long delay.. :( Thanks for the info, I have tried to educate myself as best as possible and know of the extreme memory requirements to keep things moving even slowly, and that not having enough can drag your system to a crawl. Been there, done that, 50KBps of IO to an 9 spindle VDEV (was intentionally trying to break it). Just to be clear, I am not seeing extremely heavy load (everything is golden until the crash). This is hard locking the box, as in there is no disk activity (triple checking that now), you cannot ssh into it, the kernel goes though and kills all threads (effectively panicing the box). It is my understanding this is not expected behavior. |
Confirmed, zero disk activity. |
Looks like I see a small amount of write IO just before the lock:
|
Makes sense, but I was under the I'm under the impression that @behlendorf added a fix in #5449 and #5706 in 7.0rc4 for this scenario.
But that is just my interpretation, I may be reading it wrong. :( |
Alright, some interesting developments… I was ready to give up and just move on. Before I could get some additional memory for the server I wanted to get some files off (as it would be a while). To do so, I blocked the mount, imported the pool, and marked it as read only. I then unblocked the mount and run the mount command. Interestingly it started doing HUGE amounts of writes! I am not sure exactly how much, but over the course of an hour I would guess somewhere around a GB worth of writes… After an hour, the file system mounted… Alright…. Wasn’t what I would have expected… So, I said WTH, set the pool to R/W and waited. Zero IO… Hmmm… K… So I rebooted the box to clear any cache. I have auto import disabled, so once the system was back, I ran the import command. It took a few minutes, but the system mounted! But… it started doing the IO churn again in the background; however, this time space is slowly being freed and I am hovering around 30% memory usage after several hours with about 600GBs of the 1TB delete freed. There is SOMETHING up with the delete logic... |
Maybe it cleared because I upgraded but was in a bugged state until then??? IDK. |
Once upon a time, I crudely forced the system's hand by forcing a txg commit periodically during large deletes: nwf@66dbeba ; the official fix 194202e uses a percentage based threshold rather than just counting, but amounts to something similar. Clamping down on the number of frees in a txg for dedup'd data means decreasing the number of DDT lookups and mutations that must be done in sync phase, which helped me survive exactly the situation described here. I'd have been curious to know how things would have evolved if you'd set zfs_per_txg_dirty_frees_percent to "absurdly small" (e.g., 1) and clamped down on zfs_dirty_data_max (down from its current 10G). (It'd be better, of course, to have a DDT-aware estimate of the amount of work to be done per free operation, which I don't believe is present.) |
I see where to set zfs_per_txg_dirty_frees_percent, but not zfs_dirty_data_max, I will keep looking and re-compile if/when I find it. I will start with just zfs_per_txg_dirty_frees_percent for now and let you know (compiling now). More interesting notes. It eventually hard locked last night. If I block it and set it to mark it RO, then mount (now taking just a few seconds) and then mark R/W, everything works perfectly! I wrote a small file to the volume, exported the pool, and re-imported using this method. The file is there… If I import the volume WITHOUT blocking the mount while the volume is in R/W mode, it goes right into the “loop”, and eventually locks the box. But is IS mounting now (before it was blocked). If I put the volume into RO before exporting, it also re-imports immediately without issue. If I attempt to put the volume into RO mode while it is in this state, the command hangs and never completes (appears to at least, not going to wait until it locks). |
You shouldn't need to recompile; these are tunable at module insertion time (and perhaps more generally during system execution). |
OH! |
Changed, rebooting, will watch... /etc/modprobe.d/zfs.conf
|
We are back to the not mounting behavior. :/ So, I let it run for a while (maybe an hour) and zpool iostat reported zero writes. Should I let it continue? Anything I can be watching? |
When I tripped over this problem myself, I saw lots of random reads (for bits of the DDT) saturating the disk's IOPS. Writes will be "bursty" as the iteration through the DDT happens (in sync phase, I think) and are then all flushed to disk at once. The goal of tamping down on dirty_frees_percent is to let each of these syncs have smaller working sets. It's entirely possible, on a machine with gobs of RAM, that even 1% is too much, but I don't know. For my hacked patch I think I let through very few deletes per TXG, as proof of concept. |
I let it run, and I got zero writes until just before it hard locked and then had a sudden burst of writes. In-fact, once it goes into this state, I haven't found any combination that generates any writes until the very last second. |
That matches my experience: the deletes translate into changes to the DDT, which, being a hash table, is accessed randomly, resulting in a very seek-y, read-only workload while changes accumulate in RAM and then are all flushed out at once when the transaction commits. This will happen repeatedly for every transaction group that commits. You should be able to see this effect, too, by tracking the anon_size in /proc/spl/kstat/zfs/arcstats (I think). When you say "hard locked and then", I am confused; do you just mean that it experienced a (brief?) pause (of I/O?) before beginning a large series of writes? Usually one uses "locked" in this context to mean a more permanent state of affairs, a la dead- or live-locked. If the stalls are still too much for you, further tamping down on the amount of dirty data permitted per transaction may be useful, as might changing the denominator of dirty_frees_percent from 100 to 1000 or larger, to really limit the number of DDT mutations that can arise from deletions. I have found that, even ignoring the DDT issues, that ZoL (at least, as of the 0.6.5 series) is wildly optimistic about how many IOPS my disks can actually achieve, and so tends to be a little stop-and-go when things start to saturate. |
Again, apologizes for the extended delay. It's been a hard month... Here is what happens as best as I can tell (when importing in R/W mode):
From stage 2 to stage 5 is less than 5 minutes, probably closer to 60 seconds. Some more information on what I am seeing right now. If I bring the pool in as R/O and switch it to R/W, I can perform deletes. These deletes trigger the normal delete process without issue. I have observed, with the changes made so far, that it will consume upwards of 20GBs of RAM purging a small batched of “normal” sized files. This memory is returned abruptly once completed. If I stop and let everything quiet down, set the pool to R/O, export and re-import, I am keeping freed space. So, deletes are working! So I am guessing there is a "stuck" transaction group that wants to roll out when I import in R/W mode. Is there any way to beak up this pending commit into smaller chucks, or should it be doing this already? Or, perhaps trigger it manually after bringing it online? Does the logic to process these commits differ from startup to normal run time? I am confused on zfs_per_txg_dirty_frees_percent, should I be setting it low or high? Right now I set it very low to:
|
Larger |
@gmelikov The desire to minimize dirty_frees_percent, and dirty data in general, is an attempt to force a DDT-modification-heavy workload across as many txgs as possible, so that the amount of DDT paged back in, and dirtied, per txg is small. The DDT mutations are not properly accounted during frees, and so will act as a large amplification factor between ZFS's estimate of disk traffic and the actual traffic. Right now, the system is running out of memory attempting to perform a transaction, and I think the culprit is all the DDT mutations; it has been, at least, in my experience with such things. In light of that aim, do you still think that maximizing dirty_data_max and frees_percent is the correct action? @BloodBlight It sounds like import might be attempting to replay the ZIL all at once; maybe dirty_frees_percent has no effect on ZIL replay, which may be why you OOM on import. I don't know if it's possible to force ZIL replay across several txgs. |
I don't have a ZIL, but I still assume there is a replay log of some sort happening here (the pending delete). Would that still be called a ZIL (for my future reference)?
I am not 100% sure on how to intrerprit the differences in the sizes here (probably doesn't help I extended the volume during this), but I would assume there is somewhere between 260GB and 1.8TBs of data to process. Alright, that is what I was fearing. From one stance it makes since process deletes at mount time, but processing deletes before mounting the volume also seems problematic. I would understand if it was required to replay everything linearly for consistency, but because I can still mount and write to the volume I can be fairly sure that isn't the case. Seems like this could be improved. At the very least, if it is required, it should be done so in the most memory effecient way possible (even if extreamly slow), with some way to monitor process. This box is now slated for being re-purposed and I plan to migrate the data off. But I still have some time with it (a week or so maybe). Should we continue, or just call it? |
What I think you don't have is a "separate log" device. The ZIL is intrinsic to the operation of ZFS. You might see if "zpool get freeing" says something and, notably, if it's decreasing in all the attempts at import. As a last-ditch effort, you might try cherry-picking nwf@66dbeba. I cannot guarantee that it lands cleanly these days, but the changes are pretty straightforward if not. The goal there is to really restrict how many deletions can be done in a txg: set zfs_dmu_free_long_range_yield to a tiny value (like 1000?) before importing and see if that helps? |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Looks like I never responded. Not sure if this should die or not. To the best of my knowledge, this is still an issue. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
First time posting to GitHub, be gentle. :)
System information
Other Config Information
Trigger
Delete a large file 1TB+
Issue
System will slowly consume all memory over the course of several hours (about 12) and hard lock. This happens both after the delete and while importing zpool on reboot.
I have had this happen before, I added a 32GB swap file (on SSD) and that seemed to help. It eventually cleared up after several attempts to reboot (took about two weeks, 12 hours a pop). I made the assumption that the delete was working, but something was causing the memory to not be released. So eventually...
This time I booted off of a live boot USB, added zfs-utils and I was surprised that not only did it attempt to mount the zpool right away (while in apt), but after about an hour it succeeded!
I thought “Cool, it cleared!” and rebooted. No go, 12 hours later, out of memory and locked (still at the boot screen with an out of memory error).
Alright, booted back into the USB stick, again, hung for about an hour, then booted! “Alright, that’s odd.”
At this point I noticed that the mount point for the tank was already taken and I could not access the volume. So, I exported the zpool, took a bit and completed. Moved the folder and re-mount. Watched the memory slowly climb and lock after 12 hours.
I move the USB boot to another system and removed ZFS. I now have the box booted again, re-blocked the mount point and have just re-installed ZFS. I am waiting for the mount to complete. I am hopping it will complete in an hour or so.
FYI, I will be on vacation for several days and unable to access the server after tomorrow.
What else should I grab as I am limited in what I can get right now.? Is this a known issue? Should I go to a newer build?
I have looked at several other open and closed issues including:
#3725
#5706
#5449
#3976
#5923
The text was updated successfully, but these errors were encountered: