Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Deletes & Memory Consumption #6783

Closed
BloodBlight opened this issue Oct 25, 2017 · 34 comments · Fixed by #16722
Closed

Large Deletes & Memory Consumption #6783

BloodBlight opened this issue Oct 25, 2017 · 34 comments · Fixed by #16722
Labels
Bot: Not Stale Override for the stale bot Component: Memory Management kernel memory management

Comments

@BloodBlight
Copy link

First time posting to GitHub, be gentle. :)

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04 LTS / 16.10 on USB
Linux Kernel Will post soon, current in LTS / 4.8.0-22 on USB
Architecture Intel Xeon
ZFS Version Will post soon, current in LTS / zfsutils-linux 0.6.5.8 on USB
SPL Version Will post soon, current in LTS / ?
RAM: 96GBs of ECC RAM

Other Config Information

  • 2 Mirrored VDEVs for a total of 10TBs usable.
  • L2ARC, metadata only
  • Dedup and compression are enabled. (Ya I know… But I have 96GBs, that should be plenty).
  • About 11TBs of data and about 2TBs free (after dedup and compression).
  • Just a handful of apps and users. No major load on the system (it just hosts files).
  • Exact details to follow once I can get back in.
  • Will post these as soon as I can:
  • modinfo zfs | grep -iw version
  • modinfo spl | grep -iw version

Trigger

Delete a large file 1TB+

Issue

System will slowly consume all memory over the course of several hours (about 12) and hard lock. This happens both after the delete and while importing zpool on reboot.

I have had this happen before, I added a 32GB swap file (on SSD) and that seemed to help. It eventually cleared up after several attempts to reboot (took about two weeks, 12 hours a pop). I made the assumption that the delete was working, but something was causing the memory to not be released. So eventually...

This time I booted off of a live boot USB, added zfs-utils and I was surprised that not only did it attempt to mount the zpool right away (while in apt), but after about an hour it succeeded!

I thought “Cool, it cleared!” and rebooted. No go, 12 hours later, out of memory and locked (still at the boot screen with an out of memory error).

Alright, booted back into the USB stick, again, hung for about an hour, then booted! “Alright, that’s odd.”

At this point I noticed that the mount point for the tank was already taken and I could not access the volume. So, I exported the zpool, took a bit and completed. Moved the folder and re-mount. Watched the memory slowly climb and lock after 12 hours.

I move the USB boot to another system and removed ZFS. I now have the box booted again, re-blocked the mount point and have just re-installed ZFS. I am waiting for the mount to complete. I am hopping it will complete in an hour or so.

FYI, I will be on vacation for several days and unable to access the server after tomorrow.

What else should I grab as I am limited in what I can get right now.? Is this a known issue? Should I go to a newer build?

I have looked at several other open and closed issues including:
#3725
#5706
#5449
#3976
#5923

@MyPod-zz
Copy link

Deleting that much data in a single go while having dedup enabled is known to cause issues such as this and to be quite memory and time hungry. AFAIK this is not an issue as there's plenty of warnings in all the documentation regarding deduplication, and I also recall a few instances in which pools had to be moved on a platform with a greater amount of RAM to successfully finish a zfs destroy.

I'd suggest to disable the L2ARC at least temporarily in an attempt to drop the need to store the metadata for it in the RAM, and possibly to bump up the amount of it in the system.

@BloodBlight
Copy link
Author

BloodBlight commented Oct 25, 2017

Some updates.

There is no L2ARC, my mistake. It used to have one but it was removed.

Blocking the mount point did indeed allow the import to complete in about one hour with minimal memory usage (about 9GBs). There does not appear to be any pending IO after several hours. It is accepting zdb commands and zpool modification commands.

At this point I am not sure what the deal is, this is really odd!

As for memory consumption and dedup, yep, I am aware. However, I have 81.2M allocated blocks so the entire DDT should be under 29GBs. Even if you use the referenced block count (102M) and 512 bytes rather than 380 that is still under 49GBs. So with 98GBs (right now 88GBs free) that should not be an issue.

I am more confused by the change in behavior based on the mount state. I am 90%+ confident that is I export, move the folder, and re-import that it will tank the box…

In #3725 this was discussed and the issue was supposedly patched. He was using 0.6.4, and after more digging, it sounds like this ( #5706 ) was fixed in zfs-0.7.0-rc4 and may be a starvation issue causing memory to not be freed.

I think I am going to have to block the mounting of the drive and try upgrading… But I am still really confused by the change in behavior due to the mount point location being available...

@BloodBlight
Copy link
Author

Alright, it looks like the pending transaction is process on mount and not import. So that explains that...

Looking at upgrade from source process. :/

@BloodBlight
Copy link
Author

BloodBlight commented Oct 26, 2017

Installed 0.7.2.3, started the mount several hours ago. Looks like we are on our way to a lockup, but I will get one last look at it tomorrow morning.

Here is data from my current slabtop:

slabtop -o -ss |
|
-- | --
Active / Total Objects (% used) |40680649 / 48957290 (83.1%)
Active / Total Slabs (% used) | 1613380 / 1613380 (100.0%)
Active / Total Caches (% used) | 93 / 140 (66.4%)
Active / Total Size (% used) | 28517271.18K / 33163274.38K (86.0%)
Minimum / Average / Maximum Object | 0.01K / 0.68K / 18.50K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
12537657 8479926 67.00% 1.09K 432333 29 13,834,656 KB zio_cache
3165056 3165053 99.00% 4.00K 395632 8 12,660,224 KB kmalloc-4096
5293908 5293908 100.00% 0.44K 147053 36 2,352,848 KB xfrm_dst_cache
3228096 3228096 100.00% 0.32K 134504 24 1,076,032 KB taskstats
3119283 3119283 100.00% 0.29K 115529 27 924,232 KB dmu_buf_impl_t
2894976 2894970 99.00% 0.25K 90468 32 723,744 KB kmalloc-256
12547190 8490904 67.00% 0.05K 147614 85 590,456 KB ftrace_event_field

And from /proc/spl/kstat/zfs/arcstats:

name type data
hits 4 23699472
misses 4 3308407
demand_data_hits 4 0
demand_data_misses 4 0
demand_metadata_hits 4 23699472
demand_metadata_misses 4 3308398
prefetch_data_hits 4 0
prefetch_data_misses 4 0
prefetch_metadata_hits 4 0
prefetch_metadata_misses 4 9
mru_hits 4 4862763
mru_ghost_hits 4 0
mfu_hits 4 18836709
mfu_ghost_hits 4 0
deleted 4 65
mutex_miss 4 0
evict_skip 4 30
evict_not_enough 4 0
evict_l2_cached 4 0
evict_l2_eligible 4 730112
evict_l2_ineligible 4 0
evict_l2_skip 4 0
hash_elements 4 3201684
hash_elements_max 4 3201684
hash_collisions 4 323909
hash_chains 4 261002
hash_chain_max 4 4
p 4 25328530432
c 4 50657060864
c_min 4 33554432
c_max 4 50657060864
size 4 18326494616
hdr_size 4 1357514440
data_size 4 0
metadata_size 4 16052901376
other_size 4 916078800
anon_size 4 409600
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 5683290112
mru_evictable_data 4 0
mru_evictable_metadata 4 4941213696
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 10369201664
mfu_evictable_data 4 0
mfu_evictable_metadata 4 10369193472
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 18326494616
arc_meta_limit 4 37992795648
arc_meta_max 4 10889914016
arc_meta_min 4 8589934592
arc_need_free 4 0
arc_sys_free 4 1583030272

@BloodBlight
Copy link
Author

Dead.
20171026_080510 01

@BloodBlight
Copy link
Author

No one wants to chime in on this? This has been a known bug in the past. A 1TB delete really should not be a issue with this much memory...

I have been online for about 24 hours on this last run (last one died around 10 hours in). It's still crunching along at the moment.

Memory right where I would expect it. I am using 24GBs of ARC for 81.2M DDT Entries. So that sounds perfect for the math! 81.2M * 320 / 1024^3 = 24.2GBs

arc_summary:

ZFS Subsystem Report			Fri Nov 03 19:44:43 2017
ARC Summary: (HEALTHY)
Memory Throttle Count:			0

ARC Misc:
Deleted:				32
Mutex Misses:				0
Evict Skips:				0

ARC Size:				60.65%	28.61	GiB
Target Size: (Adaptive)			100.00%	47.18	GiB
Min Size (Hard Limit):			0.07%	32.00	MiB
Max Size (High Water):			1509:1	47.18	GiB

ARC Size Breakdown:
Recently Used Cache Size:		50.00%	23.59	GiB
Frequently Used Cache Size:		50.00%	23.59	GiB

ARC Hash Breakdown:
Elements Max:				4.40m
Elements Current:			10.21%	449.47k
Collisions:				745.12k
Chain Max:				5
Chains:					5.55k

ARC Total accesses:			88.52m
Cache Hit Ratio:			93.36%	82.65m
Cache Miss Ratio:			6.64%	5.87m
Actual Hit Ratio:			93.36%	82.65m

Data Demand Efficiency:			100.00%	0

CACHE HITS BY CACHE LIST:
  Most Recently Used:			10.33%	8.54m
  Most Frequently Used:			89.67%	74.11m
  Most Recently Used Ghost:		0.00%	0
  Most Frequently Used Ghost:		0.00%	0

CACHE HITS BY DATA TYPE:
  Demand Data:				0.00%	0
  Prefetch Data:			0.00%	0
  Demand Metadata:			100.00%	82.65m
  Prefetch Metadata:			0.00%	0

CACHE MISSES BY DATA TYPE:
  Demand Data:				0.00%	0
  Prefetch Data:			0.00%	0
  Demand Metadata:			98.31%	5.78m
  Prefetch Metadata:			1.69%	99.21k

@MyPod-zz
Copy link

MyPod-zz commented Nov 4, 2017

The math might not be as simple as 81.2M * 320 / 1024^3 = 24.2GBs. I don't recall whether this applies for the DDT or not, but certain kind of data can take only up to a certain amount of the ARC by default - i.e. metadata by default goes up to 1/4. Your system might be capped like that, in which case you might want to find the related tunables. Sadly I can't offer help wrt dedup as it's something I've never dealt with.

@BloodBlight
Copy link
Author

Yep, totally understand.

From everything I have read, and what the devs have stated here, ZFS should slow to a crawl, not crash the box (as it needs to constantly go to disk for every meta op). And it’s so common for ZFS to be under build for RAM and then folks ask why it doesn’t work. So I tried to take the max possible usage and then double it (really, it just worked out that way). This is a prototyping box for other builds, so I am less interested in getting it back online as I am on ether specing the system proper, or fixing whatever the issue is. Really, I can get this box back online by just rebooting it a bunch / adding a bunch of RAM (I think).

Last night, shortly after making that post consumption spiked rapidly and tanked the box. So, what I am going to do it install Splunk and see if I can come up with a script to watch all of the key details. I am still not sure exactly what is going nuts at the end.

So I will try and watch the slab, arcstats, iostats, and anything else I can think of. Some of this will take come work on my part due to formating.

Any recommendations on what I should be capturing?

Also, looks like my tunables got chopped in editing on my last post! OOPS!

ZFS Tunable:  
metaslab_debug_load 0
zfs_arc_min_prefetch_lifespan 0
zfetch_max_streams 8
zfs_nopwrite_enabled 1
zfetch_min_sec_reap 2
zfs_dbgmsg_enable 0
zfs_dirty_data_max_max_percent 25
zfs_arc_p_aggressive_disable 1
spa_load_verify_data 1
zfs_zevent_cols 80
zfs_dirty_data_max_percent 10
zfs_sync_pass_dont_compress 5
l2arc_write_max 8388608
zfs_vdev_scrub_max_active 2
zfs_vdev_sync_write_min_active 12
zvol_prefetch_bytes 131072
metaslab_aliquot 524288
zfs_no_scrub_prefetch 0
zfs_arc_shrink_shift 0
zfetch_block_cap 256
zfs_txg_history 0
zfs_delay_scale 500000
zfs_vdev_async_write_active_min_dirty_percent 30
metaslab_debug_unload 0
zfs_read_history 0
zvol_max_discard_blocks 16384
zfs_recover 0
l2arc_headroom 2
zfs_deadman_synctime_ms 1000000
zfs_scan_idle 50
zfs_free_min_time_ms 1000
zfs_dirty_data_max 10131410944
zfs_vdev_async_read_min_active 12
zfs_mg_noalloc_threshold 0
zfs_dedup_prefetch 0
zfs_vdev_max_active 1000
l2arc_write_boost 8388608
zfs_resilver_min_time_ms 3000
zfs_vdev_async_write_max_active 12
zil_slog_limit 1048576
zfs_prefetch_disable 1
zfs_resilver_delay 2
metaslab_lba_weighting_enabled 1
zfs_mg_fragmentation_threshold 85
l2arc_feed_again 1
zfs_zevent_console 0
zfs_immediate_write_sz 32768
zfs_dbgmsg_maxsize 4194304
zfs_free_leak_on_eio 0
zfs_deadman_enabled 1
metaslab_bias_enabled 1
zfs_arc_p_dampener_disable 1
zfs_object_mutex_size 64
zfs_metaslab_fragmentation_threshold 70
zfs_no_scrub_io 0
metaslabs_per_vdev 200
zfs_dbuf_state_index 0
zfs_vdev_sync_read_min_active 12
metaslab_fragmentation_factor_enabled 1
zvol_inhibit_dev 0
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_cache_size 0
zfs_vdev_mirror_switch_us 10000
zfs_dirty_data_sync 67108864
spa_config_path /etc/zfs/zpool.cache
zfs_dirty_data_max_max 25328527360
zfs_arc_lotsfree_percent 10
zfs_zevent_len_max 384
zfs_scan_min_time_ms 1000
zfs_arc_sys_free 0
zfs_arc_meta_strategy 1
zfs_vdev_cache_bshift 16
zfs_arc_meta_adjust_restarts 4096
zfs_max_recordsize 1048576
zfs_vdev_scrub_min_active 1
zfs_vdev_read_gap_limit 32768
zfs_arc_meta_limit 0
zfs_vdev_sync_write_max_active 12
l2arc_norw 0
zfs_arc_meta_prune 10000
metaslab_preload_enabled 1
l2arc_nocompress 0
zvol_major 230
zfs_vdev_aggregation_limit 131072
zfs_flags 0
spa_asize_inflation 24
zfs_admin_snapshot 0
l2arc_feed_secs 1
zio_taskq_batch_pct 75
zfs_sync_pass_deferred_free 2
zfs_disable_dup_eviction 0
zfs_arc_grow_retry 0
zfs_read_history_hits 0
zfs_vdev_async_write_min_active 12
zfs_vdev_async_read_max_active 12
zfs_scrub_delay 4
zfs_delay_min_dirty_percent 60
zfs_free_max_blocks 10000
zfs_vdev_cache_max 16384
zio_delay_max 30000
zfs_top_maxinflight 32
spa_slop_shift 5
zfs_vdev_write_gap_limit 4096
spa_load_verify_metadata 1
spa_load_verify_maxinflight 10000
l2arc_noprefetch 1
zfs_vdev_scheduler noop
zfs_expire_snapshot 300
zfs_sync_pass_rewrite 2
zil_replay_disable 0
zfs_nocacheflush 0
zfs_arc_max 0
zfs_arc_min 0
zfs_read_chunk_size 1048576
zfs_txg_timeout 5
zfs_pd_bytes_max 52428800
l2arc_headroom_boost 200
zfs_send_corrupt_data 0
l2arc_feed_min_ms 200
zfs_arc_meta_min 8589934592
zfs_arc_average_blocksize 8192
zfetch_array_rd_sz 1048576
zfs_autoimport_disable 1
zfs_arc_p_min_shift 0
zio_requeue_io_start_cut_in_line 1
zfs_vdev_sync_read_max_active 12
zfs_mdcomp_disable 0
zfs_arc_num_sublists_per_state 8

@BloodBlight
Copy link
Author

Alright, capturing data every 60 seconds, lets see...

screenshot from 2017-11-04 14-43-01

@BloodBlight
Copy link
Author

Crashed just after 3:04 AM:
screenshot from 2017-11-05 10-48-42
All numbers are in GBs.

Not sure that this really exposes anything though... :(

@dweeezil
Copy link
Contributor

dweeezil commented Nov 6, 2017

One of the clues here is the 12537657 objects in zio_cache. A 1TiB file with a 128KiB recordsize will have 8.3M blocks, each of which need to be freed. When dedup is enabled, ZFS launches a furious storm of ZIO free operations to release the space. However, when a blkptr has been deduped, a ZAP lookup must first be performed. It's a common misconception that simply having "enough RAM" to hold "the" dedup table will allow some form of reasonable performance. Unfortunately, a ZAP lookup is a very expensive process which involves lots of locking, etc. and when there are many to be performed simultaneously (one needs to be done for each blkptr freed), there's a crazy amount of lock contention. Of course, all those zios need to be stored somewhere and that's in the zio_cache. I ran a few rudimentary tests on a system with 40 cores and 80 threads and deleting a 100GiB file frequently caused spikes of 80K+ entries in zio_cache but that system had a whole lot of horsepower to perform those ZAP lookups.

I suggest that anyone contemplating dedup view Matt's talk at https://www.youtube.com/watch?v=PYxFDBgxFS8 (slides at http://open-zfs.org/w/images/8/8d/ZFS_dedup.pdf). One of the key points is that "on-disk hash tables suck". The ZAPs in which are the structures in which they're stored are a glorified hash table but much more complex.

Here's a flame graph that demonstrates how much time is spent spinning for locking when deleting a meager 100GiB (approx) deduped file: https://gist.github.com/dweeezil/d663bb371cb927a4f3ccc4d124effbd3

I'd also like to reference #6823 and also any other issue which involves deleting large files with dedup enabled.

Without turning this into a full-fledged wiki page, here's a few suggestions to anyone wanting to deploy dedup: First off, giant files can be a real problem. If your application involves huge files consider whether dedup will buy anything at all. Large blocksizes can help because they'll reduce the numbers of entries in the dedup ZAPs for large files. Dedup can be very useful in certain circumstances but it requires a great deal of understanding to determine what those actually are. Finally (almost), better dedup will probably happen (see Matt's talk).

It does seem that something could be done to help the delete situation with the current dedup situation. A new delete throttle to prevent too many zios from piling up seems like it would help quite a bit.

@BloodBlight
Copy link
Author

BloodBlight commented Nov 11, 2017

Sorry for the long delay.. :( Thanks for the info, I have tried to educate myself as best as possible and know of the extreme memory requirements to keep things moving even slowly, and that not having enough can drag your system to a crawl. Been there, done that, 50KBps of IO to an 9 spindle VDEV (was intentionally trying to break it).

Just to be clear, I am not seeing extremely heavy load (everything is golden until the crash). This is hard locking the box, as in there is no disk activity (triple checking that now), you cannot ssh into it, the kernel goes though and kills all threads (effectively panicing the box).

It is my understanding this is not expected behavior.

@BloodBlight
Copy link
Author

Confirmed, zero disk activity.

@BloodBlight
Copy link
Author

Looks like I see a small amount of write IO just before the lock:

root@MediaServer:~# cat zpool.stat
[2017-11-11 20:44:13] capacity operations bandwidth
[2017-11-11 20:44:13] pool alloc free read write read write
[2017-11-11 20:44:13] ---------- ----- ----- ----- ----- ----- -----
[2017-11-11 20:44:13] MediaTank 7.71T 1.37T 218 70 1.07M 157K
[2017-11-11 20:45:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-11 20:46:13] MediaTank 7.71T 1.37T 61 0 133K 0
[2017-11-11 20:47:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-11 20:48:13] MediaTank 7.71T 1.37T 62 0 134K 0
[2017-11-11 20:49:13] MediaTank 7.71T 1.37T 62 0 132K 0
[2017-11-11 20:50:13] MediaTank 7.71T 1.37T 61 0 132K 0
--- all zeros ---
[2017-11-12 04:18:13] MediaTank 7.71T 1.37T 61 0 130K 0
[2017-11-12 04:19:13] MediaTank 7.71T 1.37T 61 0 131K 0
[2017-11-12 04:20:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:21:13] MediaTank 7.71T 1.37T 61 0 132K 0
[2017-11-12 04:22:13] MediaTank 7.71T 1.37T 62 0 132K 0
[2017-11-12 04:23:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:24:13] MediaTank 7.71T 1.37T 62 0 133K 0
[2017-11-12 04:25:13] MediaTank 7.71T 1.37T 61 0 131K 0
[2017-11-12 04:26:13] MediaTank 7.71T 1.37T 16 27.4K 36.6K 38.0M

@BloodBlight
Copy link
Author

Makes sense, but I was under the I'm under the impression that @behlendorf added a fix in #5449 and #5706 in 7.0rc4 for this scenario.

if (zfs_per_txg_dirty_frees_percent <= 100)

But that is just my interpretation, I may be reading it wrong. :(

@BloodBlight
Copy link
Author

Alright, some interesting developments…

I was ready to give up and just move on. Before I could get some additional memory for the server I wanted to get some files off (as it would be a while).

To do so, I blocked the mount, imported the pool, and marked it as read only.

I then unblocked the mount and run the mount command. Interestingly it started doing HUGE amounts of writes! I am not sure exactly how much, but over the course of an hour I would guess somewhere around a GB worth of writes… After an hour, the file system mounted…

Alright…. Wasn’t what I would have expected…

So, I said WTH, set the pool to R/W and waited. Zero IO… Hmmm… K… So I rebooted the box to clear any cache.

I have auto import disabled, so once the system was back, I ran the import command. It took a few minutes, but the system mounted! But… it started doing the IO churn again in the background; however, this time space is slowly being freed and I am hovering around 30% memory usage after several hours with about 600GBs of the 1TB delete freed.

There is SOMETHING up with the delete logic...

@BloodBlight
Copy link
Author

Maybe it cleared because I upgraded but was in a bugged state until then??? IDK.

@nwf
Copy link
Contributor

nwf commented Nov 19, 2017

Once upon a time, I crudely forced the system's hand by forcing a txg commit periodically during large deletes: nwf@66dbeba ; the official fix 194202e uses a percentage based threshold rather than just counting, but amounts to something similar. Clamping down on the number of frees in a txg for dedup'd data means decreasing the number of DDT lookups and mutations that must be done in sync phase, which helped me survive exactly the situation described here.

I'd have been curious to know how things would have evolved if you'd set zfs_per_txg_dirty_frees_percent to "absurdly small" (e.g., 1) and clamped down on zfs_dirty_data_max (down from its current 10G).

(It'd be better, of course, to have a DDT-aware estimate of the amount of work to be done per free operation, which I don't believe is present.)

@BloodBlight
Copy link
Author

I see where to set zfs_per_txg_dirty_frees_percent, but not zfs_dirty_data_max, I will keep looking and re-compile if/when I find it. I will start with just zfs_per_txg_dirty_frees_percent for now and let you know (compiling now).

More interesting notes.

It eventually hard locked last night.

If I block it and set it to mark it RO, then mount (now taking just a few seconds) and then mark R/W, everything works perfectly! I wrote a small file to the volume, exported the pool, and re-imported using this method. The file is there…

If I import the volume WITHOUT blocking the mount while the volume is in R/W mode, it goes right into the “loop”, and eventually locks the box. But is IS mounting now (before it was blocked).

If I put the volume into RO before exporting, it also re-imports immediately without issue.

If I attempt to put the volume into RO mode while it is in this state, the command hangs and never completes (appears to at least, not going to wait until it locks).

@nwf
Copy link
Contributor

nwf commented Nov 20, 2017

You shouldn't need to recompile; these are tunable at module insertion time (and perhaps more generally during system execution).

@BloodBlight
Copy link
Author

OH!

@BloodBlight
Copy link
Author

Changed, rebooting, will watch...

/etc/modprobe.d/zfs.conf

#options zfs zfs_arc_meta_max=51539607552
options zfs zfs_arc_meta_min=8589934592
#options zfs arc_meta_max=51539607554
options zfs arc_meta_min=8589934594

#debug
options zfs zfs_arc_meta_max=8589934592
options zfs arc_meta_max=8589934592
options zfs zfs_prefetch_disable=1
options zfs zfs_dedup_prefetch=0
options zfs zfs_free_max_blocks=100
options spl spl_kmem_cache_slab_limit=16384
options zfs zfs_per_txg_dirty_frees_percent=1
options zfs zfs_dirty_data_max=1073741824
options zfs zfs_vdev_async_read_max_active=2
options zfs zfs_vdev_async_read_min_active=1
options zfs zfs_vdev_async_write_max_active=2
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_sync_read_max_active=2
options zfs zfs_vdev_sync_read_min_active=1
options zfs zfs_vdev_sync_write_max_active=2
options zfs zfs_vdev_sync_write_min_active=1
options zfs zfs_dirty_data_max_percent=10

@BloodBlight
Copy link
Author

We are back to the not mounting behavior. :/

So, I let it run for a while (maybe an hour) and zpool iostat reported zero writes.

Should I let it continue? Anything I can be watching?

@nwf
Copy link
Contributor

nwf commented Nov 20, 2017

When I tripped over this problem myself, I saw lots of random reads (for bits of the DDT) saturating the disk's IOPS. Writes will be "bursty" as the iteration through the DDT happens (in sync phase, I think) and are then all flushed to disk at once. The goal of tamping down on dirty_frees_percent is to let each of these syncs have smaller working sets.

It's entirely possible, on a machine with gobs of RAM, that even 1% is too much, but I don't know. For my hacked patch I think I let through very few deletes per TXG, as proof of concept.

@BloodBlight
Copy link
Author

I let it run, and I got zero writes until just before it hard locked and then had a sudden burst of writes.

In-fact, once it goes into this state, I haven't found any combination that generates any writes until the very last second.

@nwf
Copy link
Contributor

nwf commented Nov 21, 2017

That matches my experience: the deletes translate into changes to the DDT, which, being a hash table, is accessed randomly, resulting in a very seek-y, read-only workload while changes accumulate in RAM and then are all flushed out at once when the transaction commits. This will happen repeatedly for every transaction group that commits. You should be able to see this effect, too, by tracking the anon_size in /proc/spl/kstat/zfs/arcstats (I think).

When you say "hard locked and then", I am confused; do you just mean that it experienced a (brief?) pause (of I/O?) before beginning a large series of writes? Usually one uses "locked" in this context to mean a more permanent state of affairs, a la dead- or live-locked.

If the stalls are still too much for you, further tamping down on the amount of dirty data permitted per transaction may be useful, as might changing the denominator of dirty_frees_percent from 100 to 1000 or larger, to really limit the number of DDT mutations that can arise from deletions. I have found that, even ignoring the DDT issues, that ZoL (at least, as of the 0.6.5 series) is wildly optimistic about how many IOPS my disks can actually achieve, and so tends to be a little stop-and-go when things start to saturate.

@BloodBlight
Copy link
Author

Again, apologizes for the extended delay. It's been a hard month...

Here is what happens as best as I can tell (when importing in R/W mode):

  1. Prolonged reads as it figures out what it needs to do (12-36 hours with zero writes), the file system is not available during this time.
  2. The system runs out of memory and starts killing processes.
  3. At this point the system becomes non-responsive, but some tasks are still working (such as Splunk).
  4. Part of ZFS gets killed (*assumption) and now another part is free to write and I see a short burst of writes.
  5. The system completely stops responding, all disks go silent, all threads stop.

From stage 2 to stage 5 is less than 5 minutes, probably closer to 60 seconds.

Some more information on what I am seeing right now.

If I bring the pool in as R/O and switch it to R/W, I can perform deletes. These deletes trigger the normal delete process without issue. I have observed, with the changes made so far, that it will consume upwards of 20GBs of RAM purging a small batched of “normal” sized files. This memory is returned abruptly once completed.

If I stop and let everything quiet down, set the pool to R/O, export and re-import, I am keeping freed space. So, deletes are working!

So I am guessing there is a "stuck" transaction group that wants to roll out when I import in R/W mode. Is there any way to beak up this pending commit into smaller chucks, or should it be doing this already? Or, perhaps trigger it manually after bringing it online? Does the logic to process these commits differ from startup to normal run time?

I am confused on zfs_per_txg_dirty_frees_percent, should I be setting it low or high? Right now I set it very low to:

options zfs zfs_per_txg_dirty_frees_percent=1
options zfs zfs_dirty_data_max=1073741824 #1GB

@gmelikov
Copy link
Member

gmelikov commented Dec 4, 2017

zfs_per_txg_dirty_frees_percent - A value of zero will disable this throttle. 1% is the worst thing if you have starvation on delete, try 0.

Larger zfs_dirty_data_max would give benefits too (by default it's 10 percent of all memory, capped at \fBzfs_dirty_data_max_max\fR).

@nwf
Copy link
Contributor

nwf commented Dec 4, 2017

@gmelikov The desire to minimize dirty_frees_percent, and dirty data in general, is an attempt to force a DDT-modification-heavy workload across as many txgs as possible, so that the amount of DDT paged back in, and dirtied, per txg is small. The DDT mutations are not properly accounted during frees, and so will act as a large amplification factor between ZFS's estimate of disk traffic and the actual traffic. Right now, the system is running out of memory attempting to perform a transaction, and I think the culprit is all the DDT mutations; it has been, at least, in my experience with such things. In light of that aim, do you still think that maximizing dirty_data_max and frees_percent is the correct action?

@BloodBlight It sounds like import might be attempting to replay the ZIL all at once; maybe dirty_frees_percent has no effect on ZIL replay, which may be why you OOM on import. I don't know if it's possible to force ZIL replay across several txgs.

@BloodBlight
Copy link
Author

I don't have a ZIL, but I still assume there is a replay log of some sort happening here (the pending delete). Would that still be called a ZIL (for my future reference)?

Filesystem      Size  Used Avail Use% Mounted on
MediaTank        12T  9.0T  2.2T  82% /mnt
##################################
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
MediaTank  9.97T  7.51T  2.46T         -    70%    75%  1.21x  ONLINE  -

I am not 100% sure on how to intrerprit the differences in the sizes here (probably doesn't help I extended the volume during this), but I would assume there is somewhere between 260GB and 1.8TBs of data to process.

Alright, that is what I was fearing. From one stance it makes since process deletes at mount time, but processing deletes before mounting the volume also seems problematic. I would understand if it was required to replay everything linearly for consistency, but because I can still mount and write to the volume I can be fairly sure that isn't the case. Seems like this could be improved. At the very least, if it is required, it should be done so in the most memory effecient way possible (even if extreamly slow), with some way to monitor process.

This box is now slated for being re-purposed and I plan to migrate the data off. But I still have some time with it (a week or so maybe). Should we continue, or just call it?

@nwf
Copy link
Contributor

nwf commented Dec 5, 2017

What I think you don't have is a "separate log" device. The ZIL is intrinsic to the operation of ZFS.

You might see if "zpool get freeing" says something and, notably, if it's decreasing in all the attempts at import.

As a last-ditch effort, you might try cherry-picking nwf@66dbeba. I cannot guarantee that it lands cleanly these days, but the changes are pretty straightforward if not. The goal there is to really restrict how many deletions can be done in a txg: set zfs_dmu_free_long_range_yield to a tiny value (like 1000?) before importing and see if that helps?

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@BloodBlight
Copy link
Author

Looks like I never responded. Not sure if this should die or not. To the best of my knowledge, this is still an issue.

@stale stale bot removed the Status: Stale No recent activity for issue label Aug 30, 2020
@stale
Copy link

stale bot commented Aug 31, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 31, 2021
@behlendorf behlendorf added Bot: Not Stale Override for the stale bot Component: Memory Management kernel memory management and removed Status: Stale No recent activity for issue labels Sep 2, 2021
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Nov 15, 2024
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.

Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.

This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.

For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#6783
Closes openzfs#16708
Closes openzfs#16722 
Closes openzfs#16697
ixhamza pushed a commit to truenas/zfs that referenced this issue Dec 2, 2024
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.

Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.

This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.

For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#6783
Closes openzfs#16708
Closes openzfs#16722 
Closes openzfs#16697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Component: Memory Management kernel memory management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@behlendorf @nwf @MyPod-zz @dweeezil @gmelikov @BloodBlight and others