Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU load from arc_prune #9966

Open
dmaziuk opened this issue Feb 7, 2020 · 83 comments
Open

100% CPU load from arc_prune #9966

dmaziuk opened this issue Feb 7, 2020 · 83 comments
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@dmaziuk
Copy link

dmaziuk commented Feb 7, 2020

System information

Type Version/Name
Distribution Name centos
Distribution Version CentOS release 6.10 (Final)
Linux Kernel Linux hereland.nmrfam.wisc.edu 2.6.32-754.25.1.el6.x86_64 #1 SMP Mon Dec 23 15:19:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Architecture AMD 64
ZFS Version 0.8.3-1
SPL Version 0.8.3-1

Describe the problem you're observing

Moving 14TB of external data into a zfs dataset, after a couple of days the CPU is at 99% system load. Clients time out trying to mount the dataset (exported via sharenfs).

Describe how to reproduce the problem

It's the first one for me, no idea if it's reproducible.

Include any warning/errors/backtraces from the system logs

top - 09:49:18 up 3 days, 21:38,  3 users,  load average: 48.27, 47.56, 47.73
Tasks: 653 total,  31 running, 622 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us, 99.0%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32993392k total, 28381572k used,  4611820k free,    66360k buffers
Swap: 65535996k total,    57400k used, 65478596k free,  1500656k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20905 root      20   0     0    0    0 R 98.7  0.0 167:51.80 arc_prune
 7821 root      20   0     0    0    0 R 98.4  0.0 212:08.44 arc_prune
 8292 root      20   0     0    0    0 R 98.4  0.0  70:27.30 arc_prune
20589 root      20   0     0    0    0 R 98.4  0.0 168:11.97 arc_prune
22506 root      20   0     0    0    0 R 98.4  0.0 166:39.25 arc_prune
 1772 root      20   0     0    0    0 R 98.1  0.0 475:47.11 arc_prune
 8502 root      20   0     0    0    0 R 98.1  0.0 212:09.17 arc_prune
18799 root      20   0     0    0    0 R 98.1  0.0 169:55.35 arc_prune
19397 root      20   0     0    0    0 R 98.1  0.0 169:08.63 arc_prune
20312 root      20   0     0    0    0 R 98.1  0.0 168:14.17 arc_prune
23623 root      20   0     0    0    0 R 98.1  0.0 164:48.99 arc_prune
28861 root      20   0     0    0    0 R 98.1  0.0 137:44.49 arc_prune
21294 root      20   0     0    0    0 R 97.8  0.0 167:55.44 arc_prune
19192 root      20   0     0    0    0 R 96.5  0.0   4:46.02 arc_prune
23087 root      20   0     0    0    0 R 95.8  0.0 165:28.73 arc_prune
19802 root      20   0     0    0    0 R 94.9  0.0 168:58.81 arc_prune
 1773 root      39  19     0    0    0 R  6.1  0.0  84:38.77 zthr_procedure
19929 root      20   0 15432 1724  944 R  1.9  0.0   0:00.50 top
@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

More info:

# zpool status
  pool: backups
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:10:52 with 0 errors on Sat Feb  1 00:10:54 2020
config:

        NAME                        STATE     READ WRITE CKSUM
        backups                     ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            scsi-35000c500860d0c4b  ONLINE       0     0     0
            scsi-35000c500860d1543  ONLINE       0     0     0
            scsi-35000c500860f9437  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0 days 21:22:47 with 0 errors on Sat Feb  1 21:22:49 2020
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz1-0                        ONLINE       0     0     0
            scsi-35000c500860ba403        ONLINE       0     0     0
            scsi-35000c500860bbb4f        ONLINE       0     0     0
            scsi-35000c500860bbb5f        ONLINE       0     0     0
          raidz1-2                        ONLINE       0     0     0
            scsi-35000c500860bcb77        ONLINE       0     0     0
            scsi-35000c500860cf9d7        ONLINE       0     0     0
            scsi-35000c500860d0b67        ONLINE       0     0     0
          raidz1-3                        ONLINE       0     0     0
            scsi-35000c500860f96db        ONLINE       0     0     0
            scsi-35000c500860fc85f        ONLINE       0     0     0
            scsi-35000c500860fd93b        ONLINE       0     0     0
        logs
          mirror-1                        ONLINE       0     0     0
            scsi-35000c500301bf47f-part1  ONLINE       0     0     0
            scsi-35000c500301bf69f-part1  ONLINE       0     0     0
        cache
          scsi-35000c500301bf6c3-part1    ONLINE       0     0     0

errors: No known data errors

# zpool iostat
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
backups      402G  21.4T      1      0  6.71K   204K
tank        31.9T  33.5T    783    217  33.8M  26.7M
----------  -----  -----  -----  -----  -----  -----

ARC_MAX is not set, it's left at the default. Cache SSD is 160GB. All disks are dual-ported SAS drives in an external shelf.

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

Even more info:

...
Feb  7 04:55:40 hereland zed[6190]: Invoking "all-syslog.sh" eid=180 pid=16962
Feb  7 04:55:40 hereland zed: eid=180 class=history_event pool_guid=0x68C093597CA7900F
Feb  7 04:55:40 hereland zed[6190]: Finished "all-syslog.sh" eid=180 pid=16962 exit=0
Feb  7 04:55:41 hereland zed[6190]: Invoking "all-syslog.sh" eid=181 pid=17099
Feb  7 04:55:41 hereland zed: eid=181 class=history_event pool_guid=0x68C093597CA7900F
Feb  7 04:55:41 hereland zed[6190]: Finished "all-syslog.sh" eid=181 pid=17099 exit=0
Feb  7 06:14:42 hereland kernel: INFO: task nfsd:6109 blocked for more than 120 seconds.
...

There's nothing out of the ordinary in syslog until the "hung task" nfsd.

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

Does it look like I'm getting hit by "don't make the cache disk bigger than 5 x 1/2 RAM" problem? I thought that was no longer a thing.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 7, 2020
@behlendorf
Copy link
Contributor

@dmaziuk that's a possibility. Can you post the contents of the /proc/spl/kstat/zfs/arcstats file, it should give us a better idea why the system is so aggressively trying to reclaim from the ARC.

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14203091063 342692306280336
name                            type data
hits                            4    1961861302
misses                          4    214595515
demand_data_hits                4    535849124
demand_data_misses              4    21425564
demand_metadata_hits            4    1310503883
demand_metadata_misses          4    29241451
prefetch_data_hits              4    1327
prefetch_data_misses            4    146730344
prefetch_metadata_hits          4    115506968
prefetch_metadata_misses        4    17198156
mru_hits                        4    760224963
mru_ghost_hits                  4    1685841
mfu_hits                        4    1094334124
mfu_ghost_hits                  4    1919548
deleted                         4    294724177
mutex_miss                      4    24519247
access_skip                     4    701
evict_skip                      4    554617006
evict_not_enough                4    75691673
evict_l2_cached                 4    8277499373056
evict_l2_eligible               4    14072033056256
evict_l2_ineligible             4    9966615764992
evict_l2_skip                   4    4031567
hash_elements                   4    2149096
hash_elements_max               4    6808116
hash_collisions                 4    222109741
hash_chains                     4    394313
hash_chain_max                  4    11
p                               4    15524174769
c                               4    16823934976
c_min                           4    1051495936
c_max                           4    16823934976
size                            4    16951797592
compressed_size                 4    1377493504
uncompressed_size               4    4542341632
overhead_size                   4    4540340224
hdr_size                        4    101469040
data_size                       4    742869504
metadata_size                   4    5174964224
dbuf_size                       4    2256648576
dnode_size                      4    6258464128
bonus_size                      4    2234942720
anon_size                       4    705024
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    1051861504
mru_evictable_data              4    0
mru_evictable_metadata          4    8192
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    4865267200
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    3173912
l2_misses                       4    211421480
l2_feeds                        4    786569
l2_rw_clash                     4    16
l2_read_bytes                   4    11567827968
l2_write_bytes                  4    4970878013440
l2_writes_sent                  4    689715
l2_writes_done                  4    689715
l2_writes_error                 4    0
l2_writes_lock_retry            4    1425
l2_evict_lock_retry             4    1161
l2_evict_reading                4    37
l2_evict_l1cached               4    1921025
l2_free_on_write                4    385458
l2_abort_lowmem                 4    119
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    182019350528
l2_asize                        4    140228637696
l2_hdr_size                     4    182439072
memory_throttle_count           4    0
memory_direct_count             4    143
memory_indirect_count           4    1174580
memory_all_bytes                4    33785233408
memory_free_bytes               4    7054409728
memory_available_bytes          3    6528663552
arc_no_grow                     4    0
arc_tempreserve                 4    9216
arc_loaned_bytes                4    0
arc_prune                       4    13876726
arc_meta_used                   4    16208928088
arc_meta_limit                  4    12617951232
arc_dnode_limit                 4    1261795123
arc_meta_max                    4    16893738168
arc_meta_min                    4    16777216
async_upgrade_sync              4    1740388
demand_hit_predictive_prefetch  4    67124685
demand_hit_prescient_prefetch   4    83590118
arc_need_free                   4    0
arc_sys_free                    4    525747968
arc_raw_size                    4    0

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

System load is down to 95-ish% now BTW (but nfsd is still hosed)

@behlendorf
Copy link
Contributor

According to the arcstats the ARC is entirely filled with metadata and is over the target (arc_meta_used > arc_meta_limit). The arc_prune threads are trying to to evict some of this metadata to get down to the 75% target value.

arc_meta_used                   4    16208928088
arc_meta_limit                  4    12617951232

They appear to be unable to make any progress on this which is why they're spinning. If I recall correctly, this was caused by the file handles cached by the nfsd holding a reference which prevents the ARC form freeing the buffers. I believe the nfsd behavior here is a little different in newer kernels and better behaved.

You should be able to stop the spinning by setting the limit to 100%. You can make this the default with the zfs_arc_meta_limit_percent=100 module option. But you'll need to reload the ZFS modules for that to take effect. To change the limit immediately, you can set zfs_arc_meta_limit to the arc_meta_max.

echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit

@behlendorf behlendorf added the Component: Memory Management kernel memory management label Feb 7, 2020
@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit

Done, arc_meta_used is going down now, but very slowly, with occasional dips in system load down to 80%-ish.

@behlendorf if you need anything else from this system, say so because my users are crying, I have to bounce it.

Could it be possible to make it not use all available cores? -- it's a 16-core system that has 16 instances of arc_prune. I wonder if it should be limited to e.g. half the cores, like arc is limited to half the RAM by default.

@behlendorf
Copy link
Contributor

Feel free to bounce the system, the arcstats are enough to confirm the issue. It definitely can, and should, be improved. We have some ideas about how to improve things.

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

Is this from RAM ARC or cache SSD ARC? I.e. if it's the latter, can I get around it by disconnecting cache device? -- Temporarily, while moving a zillion files around.

@behlendorf
Copy link
Contributor

behlendorf commented Feb 7, 2020

It's the RAM ARC, disconnecting the cache device won't help. Setting zfs_arc_meta_limit_percent=100 should avoid the spinning.

@dmaziuk
Copy link
Author

dmaziuk commented Feb 7, 2020

OK, thank you. I did set in /etc/modprobe.d/zfs.conf though I really don't expect to hit this again. Unless another group moves in with 8 years worth of data that they think they want on our "big" file server.

Thanks again

@mailinglists35
Copy link

mailinglists35 commented Feb 12, 2020

To change the limit immediately, you can set zfs_arc_meta_limit to the arc_meta_max

@behlendorf thanks for providing the hint, it made the spinning go away for me as well!
below were the parameters/statistics on my system when spinning occured (happened during send/receive to/from local pools)

how do you plan to solve this by default in the future without needing user intervention?

root@zfs:~# systool -vm spl
Module = "spl"

  Attributes:
    coresize            = "106496"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "5"
    srcversion          = "09B6CA3EE9E5F5E76F82D5C"
    taint               = "OE"
    uevent              = <store method only>
    version             = "0.8.3-1~rlaager1.18.04.1"

  Parameters:
    spl_hostid          = "0"
    spl_hostid_path     = "/etc/hostid"
    spl_kmem_alloc_max  = "1048576"
    spl_kmem_alloc_warn = "65536"
    spl_kmem_cache_expire= "2"
    spl_kmem_cache_kmem_limit= "2048"
    spl_kmem_cache_kmem_threads= "4"
    spl_kmem_cache_magazine_size= "0"
    spl_kmem_cache_max_size= "32"
    spl_kmem_cache_obj_per_slab_min= "1"
    spl_kmem_cache_obj_per_slab= "8"
    spl_kmem_cache_reclaim= "0"
    spl_kmem_cache_slab_limit= "16384"
    spl_max_show_tasks  = "512"
    spl_panic_halt      = "0"
    spl_schedule_hrtimeout_slack_us= "0"
    spl_taskq_kick      = "0"
    spl_taskq_thread_bind= "0"
    spl_taskq_thread_dynamic= "1"
    spl_taskq_thread_priority= "1"
    spl_taskq_thread_sequential= "4"

  Sections:
    .altinstr_replacement= "0xffffffffc049b795"
    .altinstructions    = "0xffffffffc049f082"
    .bss                = "0xffffffffc04a0ec0"
    .data               = "0xffffffffc04a0000"
    .exit.text          = "0xffffffffc049b762"
    .gnu.linkonce.this_module= "0xffffffffc04a0b80"
    .init.text          = "0xffffffffc04ac000"
    .note.gnu.build-id  = "0xffffffffc049c000"
    .parainstructions   = "0xffffffffc049d358"
    .rodata             = "0xffffffffc049c820"
    .rodata.str1.1      = "0xffffffffc049cdc6"
    .rodata.str1.8      = "0xffffffffc049e918"
    .smp_locks          = "0xffffffffc049d2c4"
    .strtab             = "0xffffffffc04b2160"
    .symtab             = "0xffffffffc04ad000"
    .text               = "0xffffffffc0491000"
    __bug_table         = "0xffffffffc04a0b60"
    __ksymtab           = "0xffffffffc049c030"
    __ksymtab_strings   = "0xffffffffc049d7d4"
    __mcount_loc        = "0xffffffffc049e238"
    __param             = "0xffffffffc049def0"

root@zfs:~# systool -vm zfs
Module = "zfs"

  Attributes:
    coresize            = "3854336"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "38"
    srcversion          = "C6177AA5049CC30B672B1CA"
    taint               = "POE"
    uevent              = <store method only>
    version             = "0.8.3-1~rlaager1.18.04.1"

  Parameters:
    dbuf_cache_hiwater_pct= "10"
    dbuf_cache_lowater_pct= "10"
    dbuf_cache_max_bytes= "33554432"
    dbuf_cache_shift    = "5"
    dbuf_metadata_cache_max_bytes= "16777216"
    dbuf_metadata_cache_shift= "6"
    dmu_object_alloc_chunk_shift= "7"
    dmu_prefetch_max    = "134217728"
    ignore_hole_birth   = "1"
    l2arc_feed_again    = "1"
    l2arc_feed_min_ms   = "200"
    l2arc_feed_secs     = "1"
    l2arc_headroom      = "2"
    l2arc_headroom_boost= "200"
    l2arc_noprefetch    = "1"
    l2arc_norw          = "0"
    l2arc_write_boost   = "8388608"
    l2arc_write_max     = "8388608"
    metaslab_aliquot    = "524288"
    metaslab_bias_enabled= "1"
    metaslab_debug_load = "0"
    metaslab_debug_unload= "0"
    metaslab_df_max_search= "16777216"
    metaslab_df_use_largest_segment= "0"
    metaslab_force_ganging= "16777217"
    metaslab_fragmentation_factor_enabled= "1"
    metaslab_lba_weighting_enabled= "1"
    metaslab_preload_enabled= "1"
    send_holes_without_birth_time= "1"
    spa_asize_inflation = "24"
    spa_config_path     = "/etc/zfs/zpool.cache"
    spa_load_print_vdev_tree= "0"
    spa_load_verify_data= "1"
    spa_load_verify_metadata= "1"
    spa_load_verify_shift= "4"
    spa_slop_shift      = "5"
    vdev_removal_max_span= "32768"
    vdev_validate_skip  = "0"
    zap_iterate_prefetch= "1"
    zfetch_array_rd_sz  = "1048576"
    zfetch_max_distance = "8388608"
    zfetch_max_streams  = "8"
    zfetch_min_sec_reap = "2"
    zfs_abd_scatter_enabled= "1"
    zfs_abd_scatter_max_order= "10"
    zfs_abd_scatter_min_size= "1536"
    zfs_admin_snapshot  = "0"
    zfs_arc_average_blocksize= "8192"
    zfs_arc_dnode_limit = "0"
    zfs_arc_dnode_limit_percent= "10"
    zfs_arc_dnode_reduce_percent= "10"
    zfs_arc_grow_retry  = "0"
    zfs_arc_lotsfree_percent= "10"
    zfs_arc_max         = "1073741824"
    zfs_arc_meta_adjust_restarts= "4096"
    zfs_arc_meta_limit_percent= "75"
    zfs_arc_meta_limit  = "0"
    zfs_arc_meta_min    = "0"
    zfs_arc_meta_prune  = "10000"
    zfs_arc_meta_strategy= "1"
    zfs_arc_min_prefetch_ms= "0"
    zfs_arc_min_prescient_prefetch_ms= "0"
    zfs_arc_min         = "0"
    zfs_arc_p_dampener_disable= "1"
    zfs_arc_p_min_shift = "0"
    zfs_arc_pc_percent  = "0"
    zfs_arc_shrink_shift= "0"
    zfs_arc_sys_free    = "0"
    zfs_async_block_max_blocks= "100000"
    zfs_autoimport_disable= "1"
    zfs_checksum_events_per_second= "20"
    zfs_commit_timeout_pct= "5"
    zfs_compressed_arc_enabled= "1"
    zfs_condense_indirect_commit_entry_delay_ms= "0"
    zfs_condense_indirect_vdevs_enable= "1"
    zfs_condense_max_obsolete_bytes= "1073741824"
    zfs_condense_min_mapping_bytes= "131072"
    zfs_dbgmsg_enable   = "1"
    zfs_dbgmsg_maxsize  = "4194304"
    zfs_dbuf_state_index= "0"
    zfs_ddt_data_is_special= "1"
    zfs_deadman_checktime_ms= "60000"
    zfs_deadman_enabled = "1"
    zfs_deadman_failmode= "wait"
    zfs_deadman_synctime_ms= "600000"
    zfs_deadman_ziotime_ms= "300000"
    zfs_dedup_prefetch  = "0"
    zfs_delay_min_dirty_percent= "60"
    zfs_delay_scale     = "500000"
    zfs_delete_blocks   = "20480"
    zfs_dirty_data_max_max= "2020201472"
    zfs_dirty_data_max_max_percent= "25"
    zfs_dirty_data_max_percent= "10"
    zfs_dirty_data_max  = "808080588"
    zfs_dirty_data_sync_percent= "20"
    zfs_disable_ivset_guid_check= "0"
    zfs_dmu_offset_next_sync= "0"
    zfs_expire_snapshot = "300"
    zfs_flags           = "0"
    zfs_free_bpobj_enabled= "1"
    zfs_free_leak_on_eio= "0"
    zfs_free_min_time_ms= "1000"
    zfs_immediate_write_sz= "32768"
    zfs_initialize_value= "16045690984833335022"
    zfs_key_max_salt_uses= "400000000"
    zfs_lua_max_instrlimit= "100000000"
    zfs_lua_max_memlimit= "104857600"
    zfs_max_missing_tvds= "0"
    zfs_max_recordsize  = "1048576"
    zfs_metaslab_fragmentation_threshold= "70"
    zfs_metaslab_segment_weight_enabled= "1"
    zfs_metaslab_switch_threshold= "2"
    zfs_mg_fragmentation_threshold= "95"
    zfs_mg_noalloc_threshold= "0"
    zfs_multihost_fail_intervals= "10"
    zfs_multihost_history= "0"
    zfs_multihost_import_intervals= "20"
    zfs_multihost_interval= "1000"
    zfs_multilist_num_sublists= "0"
    zfs_no_scrub_io     = "0"
    zfs_no_scrub_prefetch= "0"
    zfs_nocacheflush    = "0"
    zfs_nopwrite_enabled= "1"
    zfs_object_mutex_size= "64"
    zfs_obsolete_min_time_ms= "500"
    zfs_override_estimate_recordsize= "0"
    zfs_pd_bytes_max    = "52428800"
    zfs_per_txg_dirty_frees_percent= "5"
    zfs_prefetch_disable= "0"
    zfs_read_chunk_size = "1048576"
    zfs_read_history    = "0"
    zfs_read_history_hits= "0"
    zfs_reconstruct_indirect_combinations_max= "4096"
    zfs_recover         = "0"
    zfs_recv_queue_length= "16777216"
    zfs_removal_ignore_errors= "0"
    zfs_removal_suspend_progress= "0"
    zfs_remove_max_segment= "16777216"
    zfs_resilver_disable_defer= "0"
    zfs_resilver_min_time_ms= "3000"
    zfs_scan_checkpoint_intval= "7200"
    zfs_scan_fill_weight= "3"
    zfs_scan_ignore_errors= "0"
    zfs_scan_issue_strategy= "0"
    zfs_scan_legacy     = "0"
    zfs_scan_max_ext_gap= "2097152"
    zfs_scan_mem_lim_fact= "20"
    zfs_scan_mem_lim_soft_fact= "20"
    zfs_scan_strict_mem_lim= "0"
    zfs_scan_suspend_progress= "0"
    zfs_scan_vdev_limit = "4194304"
    zfs_scrub_min_time_ms= "1000"
    zfs_send_corrupt_data= "0"
    zfs_send_queue_length= "16777216"
    zfs_send_unmodified_spill_blocks= "1"
    zfs_slow_io_events_per_second= "20"
    zfs_spa_discard_memory_limit= "16777216"
    zfs_special_class_metadata_reserve_pct= "25"
    zfs_sync_pass_deferred_free= "2"
    zfs_sync_pass_dont_compress= "8"
    zfs_sync_pass_rewrite= "2"
    zfs_sync_taskq_batch_pct= "75"
    zfs_trim_extent_bytes_max= "134217728"
    zfs_trim_extent_bytes_min= "32768"
    zfs_trim_metaslab_skip= "0"
    zfs_trim_queue_limit= "10"
    zfs_trim_txg_batch  = "32"
    zfs_txg_history     = "100"
    zfs_txg_timeout     = "5"
    zfs_unlink_suspend_progress= "0"
    zfs_user_indirect_is_special= "1"
    zfs_vdev_aggregate_trim= "0"
    zfs_vdev_aggregation_limit_non_rotating= "131072"
    zfs_vdev_aggregation_limit= "1048576"
    zfs_vdev_async_read_max_active= "3"
    zfs_vdev_async_read_min_active= "1"
    zfs_vdev_async_write_active_max_dirty_percent= "60"
    zfs_vdev_async_write_active_min_dirty_percent= "30"
    zfs_vdev_async_write_max_active= "10"
    zfs_vdev_async_write_min_active= "2"
    zfs_vdev_cache_bshift= "16"
    zfs_vdev_cache_max  = "16384"
    zfs_vdev_cache_size = "0"
    zfs_vdev_default_ms_count= "200"
    zfs_vdev_initializing_max_active= "1"
    zfs_vdev_initializing_min_active= "1"
    zfs_vdev_max_active = "1000"
    zfs_vdev_min_ms_count= "16"
    zfs_vdev_mirror_non_rotating_inc= "0"
    zfs_vdev_mirror_non_rotating_seek_inc= "1"
    zfs_vdev_mirror_rotating_inc= "0"
    zfs_vdev_mirror_rotating_seek_inc= "5"
    zfs_vdev_mirror_rotating_seek_offset= "1048576"
    zfs_vdev_ms_count_limit= "131072"
    zfs_vdev_queue_depth_pct= "1000"
    zfs_vdev_raidz_impl = "cycle [fastest] original scalar sse2 ssse3 "
    zfs_vdev_read_gap_limit= "32768"
    zfs_vdev_removal_max_active= "2"
    zfs_vdev_removal_min_active= "1"
    zfs_vdev_scheduler  = "unused"
    zfs_vdev_scrub_max_active= "2"
    zfs_vdev_scrub_min_active= "1"
    zfs_vdev_sync_read_max_active= "10"
    zfs_vdev_sync_read_min_active= "10"
    zfs_vdev_sync_write_max_active= "10"
    zfs_vdev_sync_write_min_active= "10"
    zfs_vdev_trim_max_active= "2"
    zfs_vdev_trim_min_active= "1"
    zfs_vdev_write_gap_limit= "4096"
    zfs_zevent_cols     = "80"
    zfs_zevent_console  = "0"
    zfs_zevent_len_max  = "512"
    zfs_zil_clean_taskq_maxalloc= "1048576"
    zfs_zil_clean_taskq_minalloc= "1024"
    zfs_zil_clean_taskq_nthr_pct= "100"
    zil_maxblocksize    = "131072"
    zil_nocacheflush    = "0"
    zil_replay_disable  = "0"
    zil_slog_bulk       = "786432"
    zio_deadman_log_all = "0"
    zio_dva_throttle_enabled= "1"
    zio_requeue_io_start_cut_in_line= "1"
    zio_slow_io_ms      = "30000"
    zio_taskq_batch_pct = "75"
    zvol_inhibit_dev    = "0"
    zvol_major          = "230"
    zvol_max_discard_blocks= "16384"
    zvol_prefetch_bytes = "131072"
    zvol_request_sync   = "0"
    zvol_threads        = "32"
    zvol_volmode        = "1"

  Sections:
    .bss                = "0xffffffffc07d2100"
    .data               = "0xffffffffc07c0060"
    .exit.text          = "0xffffffffc07738c9"
    .gnu.linkonce.this_module= "0xffffffffc07d1dc0"
    .init.text          = "0xffffffffc09a7000"
    .note.gnu.build-id  = "0xffffffffc0774000"
    .parainstructions   = "0xffffffffc07aec00"
    .rodata             = "0xffffffffc0776800"
    .rodata.str1.1      = "0xffffffffc078de1c"
    .rodata.str1.8      = "0xffffffffc0792cb0"
    .smp_locks          = "0xffffffffc078d684"
    .strtab             = "0xffffffffc09e1178"
    .symtab             = "0xffffffffc09a9000"
    .text               = "0xffffffffc05f9000"
    .text.unlikely      = "0xffffffffc07731d8"
    __bug_table         = "0xffffffffc07d1da4"
    __jump_table        = "0xffffffffc07c0000"
    __ksymtab           = "0xffffffffc0774030"
    __ksymtab_strings   = "0xffffffffc07bd148"
    __mcount_loc        = "0xffffffffc07b4a28"
    __param             = "0xffffffffc07b2610"


root@zfs:~# cat /proc/spl/kstat/zfs/arcstats
12 1 0x01 98 26656 14093184634 1224910481851349
name                            type data
hits                            4    704041195
misses                          4    2744949
demand_data_hits                4    260501
demand_data_misses              4    476953
demand_metadata_hits            4    703209429
demand_metadata_misses          4    969970
prefetch_data_hits              4    11386
prefetch_data_misses            4    239621
prefetch_metadata_hits          4    559879
prefetch_metadata_misses        4    1058405
mru_hits                        4    13617271
mru_ghost_hits                  4    17737
mfu_hits                        4    690001738
mfu_ghost_hits                  4    1765
deleted                         4    3101597
mutex_miss                      4    1677
access_skip                     4    5
evict_skip                      4    39643211
evict_not_enough                4    7411892
evict_l2_cached                 4    0
evict_l2_eligible               4    21536072704
evict_l2_ineligible             4    40198401024
evict_l2_skip                   4    0
hash_elements                   4    13708
hash_elements_max               4    432738
hash_collisions                 4    364283
hash_chains                     4    94
hash_chain_max                  4    5
p                               4    689080832
c                               4    1073741824
c_min                           4    252525184
c_max                           4    1073741824
size                            4    1222968480
compressed_size                 4    295622656
uncompressed_size               4    332941312
overhead_size                   4    273108480
hdr_size                        4    6912992
data_size                       4    33457664
metadata_size                   4    535273472
dbuf_size                       4    133838016
dnode_size                      4    380898624
bonus_size                      4    132587712
anon_size                       4    50364416
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    310185984
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    208180736
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    8080805888
memory_free_bytes               4    3879395328
memory_available_bytes          3    3753136128
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    15495680
arc_prune                       4    11107355
arc_meta_used                   4    1189510816
arc_meta_limit                  4    805306368
arc_dnode_limit                 4    80530636
arc_meta_max                    4    1189630648
arc_meta_min                    4    16777216
async_upgrade_sync              4    1796
demand_hit_predictive_prefetch  4    77527
demand_hit_prescient_prefetch   4    1448691
arc_need_free                   4    0
arc_sys_free                    4    126262592
arc_raw_size                    4    195579904


root@zfs:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Feb 12 16:33:48 2020
Linux 4.15.0-76-generic                         0.8.3-1~rlaager1.18.04.1
Machine: zfs (x86_64)                       0.8.3-1~rlaager1.18.04.1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   114.5 %    1.1 GiB
        Target size (adaptive):                       100.0 %    1.0 GiB
        Min size (hard limit):                         23.5 %  240.8 MiB
        Max size (high water):                            4:1    1.0 GiB
        Most Frequently Used (MFU) cache size:         39.7 %  199.4 MiB
        Most Recently Used (MRU) cache size:           60.3 %  302.6 MiB
        Metadata cache size (hard limit):              75.0 %  768.0 MiB
        Metadata cache size (current):                148.5 %    1.1 GiB
        Dnode cache size (hard limit):                 10.0 %   76.8 MiB
        Dnode cache size (current):                   474.1 %  364.1 MiB

ARC hash breakdown:
        Elements max:                                             432.7k
        Elements current:                               3.2 %      13.9k
        Collisions:                                               364.3k
        Chain max:                                                     5
        Chains:                                                       98

ARC misc:
        Deleted:                                                    3.1M
        Mutex misses:                                               1.7k
        Eviction skips:                                            40.1M

ARC total accesses (hits + misses):                               706.8M
        Cache hit ratio:                               99.6 %     704.0M
        Cache miss ratio:                               0.4 %       2.7M
        Actual hit ratio (MFU + MRU hits):             99.6 %     703.6M
        Data demand efficiency:                        35.3 %     737.5k
        Data prefetch efficiency:                       4.5 %     251.0k

Cache hits by cache type:
        Most frequently used (MFU):                    98.0 %     690.0M
        Most recently used (MRU):                       1.9 %      13.6M
        Most frequently used (MFU) ghost:             < 0.1 %       1.8k
        Most recently used (MRU) ghost:               < 0.1 %      17.7k
        Anonymously used:                               0.1 %     402.7k

Cache hits by data type:
        Demand data:                                  < 0.1 %     260.5k
        Demand prefetch data:                         < 0.1 %      11.4k
        Demand metadata:                               99.9 %     703.2M
        Demand prefetch metadata:                       0.1 %     559.9k

Cache misses by data type:
        Demand data:                                   17.4 %     477.0k
        Demand prefetch data:                           8.7 %     239.6k
        Demand metadata:                               35.3 %     970.0k
        Demand prefetch metadata:                      38.6 %       1.1M

DMU prefetch efficiency:                                          439.6M
        Hit ratio:                                      0.1 %     235.6k
        Miss ratio:                                    99.9 %     439.3M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_expire                                          2
        spl_kmem_cache_kmem_limit                                   2048
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_obj_per_slab_min                                1
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                                    33554432
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes                           16777216
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        dmu_prefetch_max                                       134217728
        ignore_hole_birth                                              1
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                          524288
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_df_max_search                                  16777216
        metaslab_df_use_largest_segment                                0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_metadata                                       1
        spa_load_verify_shift                                          4
        spa_slop_shift                                                 5
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zap_iterate_prefetch                                           1
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                      8388608
        zfetch_max_streams                                             8
        zfetch_min_sec_reap                                            2
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     10
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                           1073741824
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                                    0
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_shrink_shift                                           0
        zfs_arc_sys_free                                               0
        zfs_async_block_max_blocks                                100000
        zfs_autoimport_disable                                         1
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                        wait
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_delay_min_dirty_percent                                   60
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                     808080588
        zfs_dirty_data_max_max                                2020201472
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       0
        zfs_expire_snapshot                                          300
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_immediate_write_sz                                     32768
        zfs_initialize_value                        16045690984833335022
        zfs_key_max_salt_uses                                  400000000
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_missing_tvds                                           0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_mg_fragmentation_threshold                                95
        zfs_mg_noalloc_threshold                                       0
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      1000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                                5
        zfs_prefetch_disable                                           0
        zfs_read_chunk_size                                      1048576
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_length                                   16777216
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_queue_length                                   16777216
        zfs_send_unmodified_spill_blocks                               1
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_special_class_metadata_reserve_pct                        25
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    8
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_trim_extent_bytes_max                              134217728
        zfs_trim_extent_bytes_min                                  32768
        zfs_trim_metaslab_skip                                         0
        zfs_trim_queue_limit                                          10
        zfs_trim_txg_batch                                            32
        zfs_txg_history                                              100
        zfs_txg_timeout                                                5
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregate_trim                                        0
        zfs_vdev_aggregation_limit                               1048576
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl   cycle [fastest] original scalar sse2 ssse3
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                        unused
        zfs_vdev_scrub_max_active                                      2
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1
        zfs_vdev_write_gap_limit                                    4096
        zfs_zevent_cols                                               80
        zfs_zevent_console                                             0
        zfs_zevent_len_max                                           512
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_maxblocksize                                          131072
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            0
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           75
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   1

VDEV cache disabled, skipping section

ZIL committed transactions:                                           16
        Commit requests:                                              62
        Flushes to stable storage:                                    62
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:        3.7 KiB          4


grep . /sys/module/zfs/parameters/zfs_arc_meta*
/sys/module/zfs/parameters/zfs_arc_meta_adjust_restarts:4096
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:75
/sys/module/zfs/parameters/zfs_arc_meta_min:0
/sys/module/zfs/parameters/zfs_arc_meta_prune:10000
/sys/module/zfs/parameters/zfs_arc_meta_strategy:1

@mailinglists35
Copy link

mailinglists35 commented Feb 12, 2020

the workaround had only temporary effects - arc_prune is spinning again

update: I had to increase zfs_arc_max to end spinning

@mailinglists35
Copy link

eventually it's spinning again despite increasing zfs_arc_max

@dmaziuk
Copy link
Author

dmaziuk commented Feb 12, 2020

Hasn't happened here -- knock on wood -- but I'm done copying lots of files. Mine's got 64G RAM and 160G cache SSD, ARC size is left at the default. It's a 2x8 opteron and it didn't lock up until it had arc_prune spinning on every core. (I don't have a single one in top now.)

@drescherjm
Copy link

drescherjm commented Feb 12, 2020

I have seen this arc_prune behavior last year on one server with 32GB of ECC. I had to disable the weekly scrubs because of it. I think there is another thread where I commented about that. Since I have just moved to a new building and the network topology is completely different I can't test now.

@mailinglists35
Copy link

I'm seeing the issue again after reboot, on a simple zfs umount on the newly created pool

@schallee
Copy link

I thought I'd add a note that I have also seen this on a debian buster with the debian backports version on amd64. It did clear it self out after some time (hours if I remember right). If I see it again I'll try to get more information.

A couple of data points worth noting: no cache device, it only receives snapshots from a different system, little if any user io and no networks shares from it.

@rwarren
Copy link

rwarren commented Mar 8, 2020

I'm seeing a similar problem (arc_reclaim and arc_prune chewing up cpu). Bumping zfs_arc_limit_percent only provided temporary reprieve while arc_meta filled up. Bumping all the way to 100% doesn't fix the issue.

Why is there so much metadata in the ARC in the first place? If relevant, the machine in question is a dedicated backup machine, currently doing nothing other than be the receiving end of a lot of zfs send -i "backups".

System info:

russ@pth-hs1:~$ cat /sys/module/zfs/version
0.7.5-1ubuntu16.6
russ@pth-hs1:~$ uname -a
Linux pth-hs1 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
russ@pth-hs1:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

Here is the current state of the ARC (allowing 100% arc_meta, but still having prune issues):

russ@pth-hs1:~$ grep dnode /proc/spl/kstat/zfs/arcstats
dnode_size                      4    8716864
arc_dnode_limit                 4    1684769382
russ@pth-hs1:~$ egrep "(arc_meta_used|arc_meta_limit)" /proc/spl/kstat/zfs/arcstats                        
arc_meta_used                   4    16847797336
arc_meta_limit                  4    16847693824
russ@pth-hs1:~$ grep -H . /sys/module/zfs/parameters/zfs_arc_{meta_limit{,_percent},dnode_limit{,_percent}}
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_limit_percent:100
/sys/module/zfs/parameters/zfs_arc_dnode_limit:0
/sys/module/zfs/parameters/zfs_arc_dnode_limit_percent:10
russ@pth-hs1:~$ cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 4608 41705217416 3820892427321077
name                            type data
hits                            4    7413219043
misses                          4    5467855407
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    7362177240
demand_metadata_misses          4    5453086223
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    51041803
prefetch_metadata_misses        4    14769184
mru_hits                        4    165815182
mru_ghost_hits                  4    1446761
mfu_hits                        4    7216088077
mfu_ghost_hits                  4    4321466
deleted                         4    112910070
mutex_miss                      4    40132
access_skip                     4    3387
evict_skip                      4    15984980671
evict_not_enough                4    22139723
evict_l2_cached                 4    0
evict_l2_eligible               4    948597889024
evict_l2_ineligible             4    1871813890048
evict_l2_skip                   4    0
hash_elements                   4    428907
hash_elements_max               4    2104109
hash_collisions                 4    43345413
hash_chains                     4    20493
hash_chain_max                  4    7
p                               4    16200329216
c                               4    16847693824
c_min                           4    1052980864
c_max                           4    16847693824
size                            4    16847639536
compressed_size                 4    16577940480
uncompressed_size               4    53259091456
overhead_size                   4    114052608
hdr_size                        4    141165920
data_size                       4    0
metadata_size                   4    16691993088
dbuf_size                       4    3742544
dnode_size                      4    8716864
bonus_size                      4    2021120
anon_size                       4    296960
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    31092736
mru_evictable_data              4    0
mru_evictable_metadata          4    27729920
mru_ghost_size                  4    66322432
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    66322432
mfu_size                        4    16660603392
mfu_evictable_data              4    0
mfu_evictable_metadata          4    16499196928
mfu_ghost_size                  4    21757952
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    21757952
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    33695387648
memory_free_bytes               4    13351628800
memory_available_bytes          3    12825141248
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    67235209
arc_meta_used                   4    16847639536
arc_meta_limit                  4    16847693824
arc_dnode_limit                 4    1684769382
arc_meta_max                    4    16913555592
arc_meta_min                    4    16777216
sync_wait_for_async             4    57
demand_hit_predictive_prefetch  4    2989
arc_need_free                   4    0
arc_sys_free                    4    526490432

@rwarren
Copy link

rwarren commented Mar 8, 2020

arc_prune settled down after a few hours and CPU usage is now back to normal levels. Given the timing I'm unclear if bumping zfs_arc_limit_percent to 100% was the difference maker, or if I just needed to wait for a few more hours?

Below is a crude view of the incident from a CPU load perspective across all cpus, where the middle marker is when I was bumping zfs_arc_limit_percent up, ultimately leaving it at 100%:

image

I don't know enough about ARC performance to know why metadata would be such a significant fraction, nor whether it is a long-term problem to move it from the default zfs_arc_limit_percent value of 75% to 100%.

@devZer0
Copy link

devZer0 commented Mar 11, 2020

maybe related/duplicate: #10119

@setharnold
Copy link

For what it's worth, whenever I run a workload on my system that I know will peg the arc_prune threads, I run this in another terminal:

while true ; do zpool iostat 10 12 ; echo 2 > /proc/sys/vm/drop_caches ; done

It's blunt and brutal but stops those arc_prune threads from spending 50% of the cpus trying to acquire spinlocks. It might save others from reboots, anyway.

Thanks

@mailinglists35
Copy link

not for me, i already tried that then eventually it still comes to system hang

@alek-p
Copy link
Contributor

alek-p commented May 14, 2020

might be related to #7559

ghost pushed a commit to truenas/zfs that referenced this issue Feb 2, 2022
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock.  Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.

Reduce number of threads to 1, adding tunable in case somebody wants
to try more.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12896
Issue openzfs#9966
tonyhutter pushed a commit that referenced this issue Feb 3, 2022
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock.  Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.

Reduce number of threads to 1, adding tunable in case somebody wants
to try more.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12896
Issue #9966
@stuartthebruce
Copy link

I just had another EL8.5 NFS server running zfs-2.1.2 get stuck with this problem. Is there an estimated timeline for releasing 2.1.3 with the initial set of patches for this?

@behlendorf
Copy link
Contributor

We're finalizing and testing the 2.1.3 patch stack at the moment in PR 13063. It will include at least one change which should help with this.

@stuartthebruce
Copy link

Thanks, I have just subscribed to that PR to track its progress--much appreciated!

@stuartthebruce
Copy link

FYI, I have observed a 2.1.4 system unable to keep up with arc pruning using the new default of a single thread (#13231).

@setharnold
Copy link

Hah, I love that that huge system (by my standards :) the extra latency is moving from 0.002s to 0.2s for a write operation. Even that worst case would have been a gloriously fast interaction vs my Very Bad Days.

@nvtkaszpir
Copy link

nvtkaszpir commented May 29, 2022

Hey, looks like I got bitten by that problem.
System was turned into a crawl where arc_prune was just wasting every moment of the cpu time :)

In my case I believe the culprit was caused by happening few things at once:

  • docker storage on zfs
  • zfs-auto-snapshot
  • rsync with backup coming from remote machine into the host with zfs

For some time system struggled to be stable but it managed to get working, but second arcstat flood rendered it totally unresponsive, and after 6h I decided to force power it get it back into operation.
For now system functionality recovered except docker (well, not gonna cry because there were something like 5 containers which were actually for removal)

Looks like keeping a docker on zfs + auto snapshot was really bad idea in the end - zfs-auto-snapshot was making a snap for each docker layer, so it ended in thousands)

@sjau
Copy link

sjau commented Jun 11, 2022

The one server I had issues with now works fine with zfs 2.1.4.

Not sure if that's any relevant: I have serveral servers/notebooks running NixOS with ZFS. The only one that had a problem with 2.1.1 was an intel based server. Everything else is AMD. It probably shouldn't matter, it's just something I noticed now.

@nvtkaszpir
Copy link

nvtkaszpir commented Jun 28, 2022

Hm experienced the issue today again.
I noticed that using rsync to run backup (using backintime) is triggering this, because backintime creates a lot of hardlinks.
Add to it mlocate from cron and that may also be an issue?

I guess I will have to limit number of backups and upgrade box to 22.04 to get zfs 2.1.4.

@nvtkaszpir
Copy link

nvtkaszpir commented Jun 28, 2022

Example of the issues yesterday, time is in CET (UTC+2), at 9:28 I executed reboot --force due to the fact that I was tired of waiting.
The holes in the graphs are due to the fact that metrics are gathered on the same host on zfs ;)

image

nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock.  Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.

Reduce number of threads to 1, adding tunable in case somebody wants
to try more.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12896
Issue openzfs#9966
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock.  Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.

Reduce number of threads to 1, adding tunable in case somebody wants
to try more.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12896
Issue openzfs#9966
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Nov 15, 2022
On FreeBSD vnode reclamation is single-threaded, protected by single
global lock.  Linux seems to be able to use a thread per mount point,
but at this time it creates more harm than good.

Reduce number of threads to 1, adding tunable in case somebody wants
to try more.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12896
Issue openzfs#9966
@gertvdijk
Copy link

I feel like I'm also hit by this 'arc_prune storm' issue on 2.1.6 (sorry not on newer 2.1.x because lack of newer packages in the Ubuntu PPA). What I noticed on the arcstats graphs is a staircase graph on the arc_prune counter. It's incremented by about ~5k every time it happens until data_size is down to 0.

Screenshot_20230406_181107

Also interesting to see there's significant increase of arc_meta_used right before it triggers.

Then I had a look at the code responsible for this and noticed a possible issue/cause.

Could it be that this part with goto restart at the end restarts the eviction 4096 times but never recalculates meta_used (passed by value on line 4396) in subsequent iterations and thus does not take successfully evicted pages into account from the previous run and keeps looping with the same value of adjustmnt? (adjustmnt = meta_used - arc_meta_limit; on line 4412 seems like a constant within the loops)

zfs/module/zfs/arc.c

Lines 4395 to 4412 in 184508c

static uint64_t
arc_evict_meta_balanced(uint64_t meta_used)
{
int64_t delta, prune = 0, adjustmnt;
uint64_t total_evicted = 0;
arc_buf_contents_t type = ARC_BUFC_DATA;
int restarts = MAX(zfs_arc_meta_adjust_restarts, 0);
restart:
/*
* This slightly differs than the way we evict from the mru in
* arc_evict because we don't have a "target" value (i.e. no
* "meta" arc_p). As a result, I think we can completely
* cannibalize the metadata in the MRU before we evict the
* metadata from the MFU. I think we probably need to implement a
* "metadata arc_p" value to do this properly.
*/
adjustmnt = meta_used - arc_meta_limit;

value of meta_used isn't updated in between, then:
goto restart;

Restarted 4096 times:

int zfs_arc_meta_adjust_restarts = 4096;

zfs/module/zfs/arc.c

Lines 4475 to 4478 in 184508c

if (restarts > 0) {
restarts--;
goto restart;
}

This is all still present in the 2.1.10-staging branch. I'm not an expert on C and the ZFS code base but this seems... odd. 🤔

I'm aware of a full rewrite of the ARC code on master/2.2, but I would really appreciate some clue on how to remove the trigger of this storm without having to install master/2.2. Is it safe to reduce zfs_arc_meta_adjust_restarts as kernel paramater to something like 100 or should I set zfs_arc_meta_limit_percent=100 which sounds very unsafe on 2.1.6 so I should plan some update to 2.1.9/2.1.10 soon?

Happy to provide more information as well, got all statistics at hand here.

@shodanshok
Copy link
Contributor

Does avail (from arcstat) show negative values when the issue happens?

@amotin any ideas on what is causing the issue in this case? Thanks.

@gertvdijk
Copy link

gertvdijk commented Apr 11, 2023

Hi @shodanshok, you're not gonna believe it but that's the single metric I don't have, was pulling my hair out why... turns out it's a bug in node_exporter. 😣 prometheus/node_exporter#2656
Will try to get this information when it happens next time.

Saw your commits and issue comments before already, proven very helpful along the way, thanks! 🙏🏼

Update: a new incident happened - I don't see avail running below 0 here:

arcstat output around incident ~ 13:30-13:31
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:29:46  447K   10K      2  7.8K    3  2.7K    1   350    0  208G  208G    16G
13:29:47  381K  9.1K      2  6.9K    3  2.2K    1   145    0  208G  208G    16G
13:29:48  406K  9.0K      2  7.0K    2  2.0K    1   134    0  208G  208G    16G
13:29:49  832K  9.7K      1  7.7K    1  2.1K    0   162    0  208G  209G    16G
13:29:50  381K   12K      3  7.7K    3  4.6K    2   125    0  208G  208G    16G
13:29:51  263K   15K      5  9.9K    5  5.9K    6   140    0  208G  208G    16G
13:29:52  242K  9.9K      4  8.3K    5  1.6K    1    74    0  208G  208G    16G
13:29:53  542K   10K      1  7.5K    2  2.7K    1    78    0  209G  209G    16G
13:29:54  989K   15K      1   10K    1  4.2K    1    94    0  209G  209G    16G
13:29:55  454K   23K      5   18K    6  4.7K    2   232    0  209G  208G    16G
13:29:56  1.1M   16K      1   12K    2  3.9K    0   229    0  209G  208G    16G
13:29:57  1.2M  8.3K      0  6.7K    1  1.5K    0   113    0  209G  209G    16G
13:29:58  1.0M   13K      1   10K    1  2.5K    0   184    0  209G  209G    16G
13:29:59  1.2M  8.8K      0  6.1K    0  2.7K    0   102    0  209G  209G    16G
13:30:00  1.0M   11K      1  8.2K    1  2.9K    0   139    0  209G  209G    16G
13:30:01  669K   13K      1  8.3K    2  5.0K    1   386    0  209G  209G    16G
13:30:02  460K   16K      3   13K    3  2.6K    2   225    0  209G  209G    16G
13:30:03  147K   19K     13   13K   11  6.7K   20   264    0  209G  209G    16G
13:30:04  689K   19K      2   10K    2  8.9K    3   379    0  209G  209G    16G
13:30:05  575K   17K      2   10K    2  6.6K    4   560    0  209G  209G    16G
13:30:06  362K  9.9K      2  5.4K    2  4.5K    3   257    0  209G  209G    16G
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:30:07  413K   13K      3  9.0K    3  4.3K    2   381    0  209G  209G    16G
13:30:08  2.0M  7.5K      0  6.1K    0  1.3K    0   456    0  205G  209G    19G
13:30:09  2.7M  6.7K      0  5.3K    0  1.4K    0   285    0  201G  209G    22G
13:30:10  2.1M  9.2K      0  7.1K    0  2.1K    0   345    0  198G  209G    25G
13:30:11  2.4M  5.4K      0  2.6K    0  2.8K    0   465    0  194G  209G    29G
13:30:12  2.3M   852      0   386    0   466    0   311    0  191G  209G    32G
13:30:13  2.4M  1.2K      0   291    0   918    0   384    0  187G  209G    36G
13:30:14  2.3M   588      0   218    0   370    0   309    0  184G  209G    39G
13:30:15  2.2M  1.8K      0   395    0  1.4K    0   603    0  181G  209G    42G
13:30:16  2.4M   685      0   241    0   444    0   230    0  177G  209G    45G
13:30:17  2.5M  1.9K      0   350    0  1.5K    0   818    0  174G  209G    48G
13:30:18  2.4M   805      0   276    0   529    0    18    0  171G  209G    52G
13:30:19  2.5M   254      0   170    0    84    0    31    0  166G  209G    56G
13:30:20  2.5M   581      0   443    0   138    0   416    0  163G  209G    59G
13:30:21  2.4M   767      0   256    0   511    0   120    0  159G  209G    63G
13:30:22  2.3M   322      0    85    0   237    0   197    0  155G  209G    67G
13:30:23  2.3M   350      0   133    0   217    0   132    0  151G  209G    71G
13:30:24  2.5M   174      0    98    0    76    0    31    0  148G  209G    74G
13:30:25  2.3M   478      0   206    0   272    0    32    0  144G  209G    78G
13:30:26  2.5M   571      0   226    0   345    0   129    0  140G  209G    82G
13:30:27  2.3M   205      0    85    0   120    0   104    0  136G  209G    86G
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:30:28  2.5M   470      0   201    0   269    0   116    0  132G  209G    90G
13:30:29  2.6M   385      0   211    0   174    0     6    0  128G  209G    94G
13:30:30  2.2M   426      0   230    0   196    0    75    0  124G  209G    98G
13:30:31  2.3M   258      0    98    0   160    0    48    0  120G  209G   102G
13:30:32  2.4M   836      0   302    0   534    0    27    0  116G  209G   106G
13:30:33  2.6M   855      0   209    0   646    0    95    0   90G  209G   133G
13:30:34  2.6M  1.3K      0   320    0   995    0   152    0   60G  209G   163G
13:30:35  2.6M   526      0   266    0   260    0    80    0   36G  209G   188G
13:30:36  2.2M   483      0   107    0   376    0   290    0   22G  209G   202G
13:30:37  2.4M  3.3K      0   283    0  3.0K    0  2.4K    0   22G  209G   203G
13:30:38  2.3M  2.4K      0   184    0  2.2K    0  2.2K    0   21G  209G   205G
13:30:39  2.3M  1.3K      0   214    0  1.1K    0  1000    0   21G  209G   205G
13:30:40  2.3M   212      0    55    0   157    0    48    0   21G  209G   205G
13:30:41  2.5M   617      0   184    0   433    0   180    0   21G  209G   205G
13:30:42  2.5M   910      0   169    0   741    0   149    0   22G  209G   205G
13:30:43  2.4M   974      0   156    0   818    0   270    0   22G  209G   205G
13:30:44  2.5M   960      0   240    0   720    0    69    0   22G  209G   205G
13:30:45  2.4M  2.1K      0   344    0  1.8K    0   236    0   22G  209G   205G
13:30:46  2.4M  1.8K      0   447    0  1.3K    0   191    0   22G  209G   204G
13:30:47  2.5M  2.9K      0   897    0  2.0K    0   553    0   22G  209G   204G
13:30:48  2.7M  4.2K      0  1.6K    0  2.6K    0   588    0   22G  209G   204G 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:30:49  2.5M   20K      0  5.0K    0   15K    1  1.2K    0   22G  209G   204G
13:30:50  2.5M   35K      1  7.6K    0   27K    2  9.8K    0   23G  209G   204G
13:30:51  2.6M   31K      1  7.8K    0   23K    1  6.8K    0   23G  209G   203G
13:30:52  2.4M   35K      1  9.7K    0   26K    2  9.8K    0   24G  209G   203G
13:30:53  2.3M   23K      1  4.7K    0   19K    1   10K    0   25G  209G   202G
13:30:54  2.3M   13K      0  5.6K    0  8.0K    0  1.7K    0   25G  209G   202G
13:30:55  2.3M  8.3K      0  1.6K    0  6.7K    0  2.2K    0   25G  209G   202G
13:30:56  2.1M  8.9K      0  2.7K    0  6.2K    0  2.1K    0   25G  209G   202G
13:30:57  2.2M  8.8K      0  4.1K    0  4.7K    0   424    0   25G  209G   202G
13:30:58  2.1M   10K      0  4.0K    0  6.8K    0   931    0   26G  209G   202G
13:30:59  2.2M   15K      0  4.4K    0   10K    1   977    0   26G  209G   201G
13:31:00  2.2M   19K      0  6.2K    0   13K    1  1.2K    0   26G  209G   201G
13:31:01  2.1M   32K      1   12K    1   20K    2  3.5K    0   27G  209G   201G
13:31:02  2.2M   34K      1   15K    1   19K    1  3.5K    0   27G  209G   200G
13:31:03  2.2M   33K      1   11K    0   22K    2  7.5K    0   28G  209G   200G
13:31:04  2.2M   28K      1  8.1K    0   20K    2  8.9K    0   28G  209G   199G
13:31:05  2.1M   13K      0  4.8K    0  8.7K    0  3.0K    0   28G  209G   199G
13:31:06  2.0M   19K      0  2.8K    0   16K    1  9.5K    0   29G  209G   198G
13:31:07  1.8M   20K      1  4.7K    0   16K    1  6.3K    0   29G  209G   198G
13:31:08  2.0M   44K      2   11K    1   33K    3   10K    0   30G  209G   197G
13:31:09  1.8M   57K      3   25K    2   32K    4  4.9K    0   31G  209G   197G
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
13:31:10  1.6M   38K      2   17K    2   20K    2  3.6K    0   31G  209G   196G
13:31:11  1.9M   71K      3   33K    3   37K    4  7.4K    0   32G  209G   195G
13:31:12  1.7M   47K      2   24K    2   22K    2  2.2K    0   32G  209G   195G
13:31:13  1.8M   53K      3   30K    3   22K    2  1.8K    0   32G  209G   194G
13:31:14  2.0M   34K      1   17K    1   17K    2  1.4K    0   32G  209G   194G
13:31:15  1.8M   55K      3   28K    2   26K    3  3.8K    0   33G  209G   194G
13:31:16  1.8M   54K      3   32K    3   22K    2  2.2K    0   33G  209G   193G
13:31:17  1.8M   48K      2   35K    3   13K    1  1.4K    0   34G  209G   193G
13:31:18  1.9M   28K      1   15K    1   13K    1  1.0K    0   34G  209G   193G
13:31:19  1.9M   26K      1   12K    1   13K    1  1.1K    0   34G  209G   193G
13:31:20  1.9M   39K      2   24K    2   14K    1  1.2K    0   34G  209G   192G
13:31:21  2.0M   52K      2   33K    2   19K    2  1.8K    0   35G  209G   192G
13:31:22  1.8M   40K      2   26K    2   13K    1  1.2K    0   35G  209G   192G
13:31:23  1.8M   32K      1   15K    1   16K    1   678    0   35G  209G   192G
13:31:24  1.8M   40K      2   24K    2   15K    1  1.3K    0   35G  209G   191G
13:31:25  1.9M   48K      2   30K    2   18K    2  1.2K    0   35G  209G   191G
13:31:26  1.9M   43K      2   28K    2   14K    1  1.1K    0   35G  209G   191G
13:31:27  1.9M   40K      2   22K    2   17K    2   931    0   35G  209G   191G
13:31:28  1.9M   31K      1   18K    1   12K    1   869    0   35G  209G   191G
13:31:29  1.9M   31K      1   17K    1   14K    1  1.1K    0   36G  209G   190G
13:31:30  1.9M   27K      1   16K    1   10K    1  1.2K    0   36G  209G   190G

@amotin
Copy link
Member

amotin commented Apr 11, 2023

@gertvdijk Your analysis about meta_used not being updated between iterations looks valid. As I can see, the change was made in 37fb3e4 by @pcd1193182 as a performance optimization, that appears to also change the code logic. The proper fix would probably be to update meta_used between the iterations, but if you need to do something without recompilation then dramatic reduction of zfs_arc_meta_adjust_restarts may probably reduce the issue too. It is not a real fix though, since depending on the evicted amount you may end up either not evicting enough or not pruning enough.

@shodanshok
Copy link
Contributor

@amotin Can this be fixed in the upcoming 2.1.10 release, while having your more comprehensive ARC rewrite in 2.2?

@chrisrd
Copy link
Contributor

chrisrd commented Apr 12, 2023

@gertvdijk That lack of recalculating meta_used indeed looks highly suspicious.

Are you willing/able to test a patch per attached? Note: the patch has been compile tested only.
zfs-9966.txt

@gertvdijk
Copy link

gertvdijk commented Apr 12, 2023

Thanks everyone for the replies so far.
I have adjusted to a somewhat conservative zfs_arc_meta_adjust_restarts=1024 for now and not expecting a miracle, but will observe the next incident with this change, possibly lowering it further down depending on the outcome.

Extrapolating the arc stats along the pattern for failure I should face the next incident within the next 2-4 hours. 🥲

To elaborate on the I/O patterns I do on this machine which may be relevant in this case ... (click to expand)

'Thousands' of disk images being accessed over NFS, opened as file-backed loop device on other machines. The images are files on a plain single ZFS dataset and are sparse. ZFS+NFS is the only thing this machine does basically. Continuous frequent snapshots + removal of them, combined with send/receive (replication). Mostly small random reads and sync (over)writes across the files, so using a fairly small recordsize of 16K as a balance between a low/good write amplification while not being penalized too much capacity-wise with the RAIDZ vdev setup. The pool layout is all-NVMe-SSD 10-disk one on RAIDZ2, with an extra-low-latency NVMe LOG disk (again mostly do random writes coalescing reducing the WAF). No L2ARC cache vdev. No other pools active on this server. Tweaks: ARC max to 83% of the 256GB RAM, zfs_compressed_arc_enabled=0 (all files are encrypted and the data is not compressible anyway), zfs_per_txg_dirty_frees_percent=0 (to fix a performance issue on replication with the large number of sparse files).

@chrisrd Thanks also for your confirmation and patch on this!
I don't think I can test this on the system where it happens, unfortunately, because - of course - it only happens in the production environment. 🫠 Recompilation with reloading the modules would mean downtime and yes a window could be planned, but rather not for testing.
I could do a general build/patch with this on another system, but I'm not sure that is going to be very helpful compared to a test by someone experienced with ZFS testing.


Update: the incident happened again around the time I predicted it. With just N=1 after setting zfs_arc_meta_adjust_restarts=1024 (1/4th of default value) it is definitely a preliminary finding, but the impact appears to have been reduced by the same factor. The duration of 100%-cpu time was reduced to about a minute, the arc_prune counter did not increment by 5k but by about 1.5k, the impact on I/O wait of users of this ZFS filesystem was reduced significantly. I'm considering to lower this tunable further down as a work-around while awaiting a proper fix.


Update 2: Reducing zfs_arc_meta_adjust_restarts proved to be inadequate for preventing the 'arc prune storm' on the system.

@siebenmann
Copy link
Contributor

It turns out to be possible to derive the 'avail' / 'memory_available_bytes' information from other metrics that the Prometheus node_exporter provides, because in the code the 'avail' computation is just 'memory_free_bytes - arc_sys_free', and better yet the latter is generally a constant for a given system.

(I discovered this today while digging into the statistics. See module/os/linux/zfs/arc_os.c for arc_available_memory(), which is directly used to generate the kstat when you look, and internally in the ARC code in relevant places.)

@gertvdijk
Copy link

gertvdijk commented Apr 18, 2023

Reducing zfs_arc_meta_adjust_restarts proved to be inadequate for preventing the 'arc prune storm' on the system, unfortunately. I have found another interesting tweak that does seem to work, though!

By setting zfs_arc_meta_strategy=0 (default=1, docs link), this skips over the suspicious/broken code in arc_evict_meta_balanced().

int zfs_arc_meta_strategy = ARC_STRATEGY_META_BALANCED;

zfs/module/zfs/arc.c

Lines 4523 to 4526 in 184508c

if (zfs_arc_meta_strategy == ARC_STRATEGY_META_ONLY)
return (arc_evict_meta_only(meta_used));
else
return (arc_evict_meta_balanced(meta_used));

Surprisingly, this is a dynamic tunable, so without any downtime, I was able changed this last Thursday:

$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_arc_meta_strategy

Again, a preliminary conclusion I draw from this, but looks significant in seeing two full-load days without an arc prune storm so far.

HTH


Total ARC size, metadata and data size plotted; vertical blue line indicating change of zfs_arc_meta_strategy to 0. Still seeing suboptimal bumps in metadata/data ratio at those times it used to collapse before. The arc_meta_limit is exceeded sometimes and observing > 75% metadata with zfs_arc_meta_limit=0 zfs_arc_meta_limit_percent=75, but at least it doesn't fully collapse. 😃
Screenshot_20230418_163718
The arc_prune counter has not increased anymore since.
Screenshot_20230418_164059


⚠️ Update: setting zfs_arc_meta_strategy to 0 appears very dangerous on 0.8.x. Observing OOM after rolling this out on other nodes not running on 2.1.x yet.

@ednadolski-ix
Copy link
Contributor

As this has been open for a while, I'd like to inquire if folks are still seeing this issue, and (importantly) if you have any specific way(s) to repro the condition, so that it can be investigated.

@nvtkaszpir
Copy link

I remember when I had issues back then on 0.8.x then I switched to 2.x and no issues since then.
Right now I'm on:

$ zfs --version
zfs-2.1.6-0york1~20.04
zfs-kmod-2.1.6-0york1~20.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests