Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc_adapt left spinning after rsync with lots of small files #3303

Closed
angstymeat opened this issue Apr 15, 2015 · 76 comments
Closed

arc_adapt left spinning after rsync with lots of small files #3303

angstymeat opened this issue Apr 15, 2015 · 76 comments

Comments

@angstymeat
Copy link

This is a split off of #3235. The symptom is that after a large set up rsyncs containing lots of small files, I'm left with arc_adapt spinning at around 3.3% CPU time.

I've compiled the info that @dweeezil wanted here: https://cloud.passcal.nmt.edu/index.php/s/l52T2UhZ0K7taY9.

The system is a Dell R515 with 16GB of RAM. The pool is a single raidz2 pool made of of 7 2TB SATA drives. Compression is enabled and set to lz4. atime is off.

The OS is Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.

The machine this happens on is an offsite server that hosts our offsite backups. There are 20+ rsync processes running that send large and small files from our internal systems to this system. The majority of the files are either large data files or the files you would typically find as part of a linux installation (/usr, /bin, /var, etc.)

Also, about 50 home directories are backed up containing a mix of large and small files.

This takes about 45 minutes. One hour after these jobs are started, the email servers begin their backup (so usually a 15 minute delay between the start of one set of backups and another). Also, our large data collector sends its backups at this time. These are large files, but a lot of them. It is sometime during this 2nd stage backup that this issue occurs.

During the backup, I have another process that periodically runs an echo 2 > /proc/sys/vm/drop_caches. I do this because once the ARC gets filled up performance drops drastically. Without doing this, it will take up to 10 hours to perform a backup. With it, it will take less than 2 hours.

This happens even if I do not run the periodic drop_caches, but seems to occur less often.

The drop_caches is a relatively new addition to my scripting on this system as this didn't appear to happen under 0.6.3. I don't have a good idea when it started, but I'm pretty sure it was sometime around the kmem rework on 0.6.4.

I am unable to rollback and test under 0.6.3 because while I was testing 0.6.4, the new feature flags were enabled. This unit is not a critical system, so I have quite a bit of leeway with it, as long as I don't have to destroy and recreate the pool. I usually use it to test new versions of ZFS since it gets a lot of disk usage.

@dweeezil
Copy link
Contributor

@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread() but, instead, what's likely happening is that arc_adjust_meta() is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.

You might try increasing zfs_arc_meta_prune from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster.
I'll be interested to hear whether increasing zfs_arc_meta_prune helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.

If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit to a low-ish value.

Regardless, I think you should avoid using drop_caches if possible.

@angstymeat
Copy link
Author

I had been running with zfs_arc_meta_prune set to 100000, but with the same results. When I rebooted this system last, I set it back to the default.

I really want to avoid using drop_cache since it is such a hack. I've been using it because the backup times had become so high.

If 100,000 isn't enough, do you have a recommendation for a higher number? Should I try increasing the order of magnitude each time I test it? Is there an upper limit value that will cause more problems than it solves?

@angstymeat
Copy link
Author

Also, does zfs_arc_meta_prune take effect immediately, or do I need to reboot?

@angstymeat
Copy link
Author

I just tried setting `zfs_arc_meta_limitto 100MB and uppedzfs_arc_meta_prune`` to 1,000,000 without rebooting, but nothing has changed. arc_adapt is still running, and there's no change in the amount of memory in use.

@snajpa
Copy link
Contributor

snajpa commented Apr 16, 2015

@dweezil what's your reasoning behind going as far as to say it would be best to avoid drop_caches completely? Since the VFS dentry cache can't be bypassed I see no other option, than to use this as a workaround. I've set vfs_cache_pressure to really high number (IIRC 1M), but in the end the dentry cache over time causes ARC as whole to shrink. My use-case is OpenVZ containers, where others use KVM.
If the shrinkers registered to VM can be prioritized, which is the only sensible solution I can think of, that wouldn't still do as much, as there are other dynamic caches in the system.

I'm just being curious, what negative side effects does drop_caches have? Besides obviously dropping even dentries for non-ZFS data (though that's not a problem in my scenario).

On 16 Apr 2015, at 01:36, Tim Chase notifications@github.com wrote:

@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread() but, instead, what's likely happening is that arc_adjust_meta() is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.

You might try increasing zfs_arc_meta_prune from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster.
I'll be interested to hear whether increasing zfs_arc_meta_prune helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.

If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit to a low-ish value.

Regardlesss, I think you should avoid using drop_caches if possible.


Reply to this email directly or view it on GitHub.

@angstymeat
Copy link
Author

I've set zfs_arc_meta_prune up to 50,000,000 and it's not helping much with performance, but I think I've seen larger amounts of cache cleared in arcstat.py.

My rsyncs are still running so I'm going to wait until it's finished and see if arc_adapt continues to run afterwards.

So far, I've tried this with zfs_arc_meta_prune set to 100,000 and 1,000,000 with the same results of arc_adapt continuing to run long after the rsync processes have finished.

Also, I'm seeing zfs_arc_meta_used exceeding zfs_arc_meta_limit by almost 600MB right now.

@dweeezil
Copy link
Contributor

@snajpa Regarding potential issues when using drop-caches, see openzfs/spl#420 (comment).

@angstymeat Your issue certainly appears to be the same as that which bc88866 and 2cbb06b were intended to fix. The question is why it's not working in your case. I supposed a good first step would be to watch the value of arc_prune in arcstats and make sure it's increasing.

As to increasing zfs_arc_meta_prune, now that I think about it, the default tuning of zfs_arc_meta_prune=10000 ought to be OK given that you've only got about 2 million znodes. With 2048 metadata iterations, in arc_adjust_meta(), the final prune count would be 20.48M which is much higher than the number of cached objects in your case.

@dweeezil
Copy link
Contributor

I think I found the problem. torvalds/linux@9b17c6238 (appeared first in kernel 3.12) added a node id argument to nr_cached_objects and is causing our autoconf test to fail.

As a hack, you can try:

diff --git a/config/kernel-shrink.m4 b/config/kernel-shrink.m4
index 1c211ed..6e88a7e 100644
--- a/config/kernel-shrink.m4
+++ b/config/kernel-shrink.m4
@@ -72,7 +72,7 @@ AC_DEFUN([ZFS_AC_KERNEL_NR_CACHED_OBJECTS], [
        ZFS_LINUX_TRY_COMPILE([
                #include <linux/fs.h>

-               int nr_cached_objects(struct super_block *sb) { return 0; }
+               int nr_cached_objects(struct super_block *sb, int nid) { return 0; }

                static const struct super_operations
                    sops __attribute__ ((unused)) = {

I'll see if I can work up a proper autoconf test later today (which will handle the intermediate kernels). We'll need yet another flag to handle < 3.12 kernels which still do have the callback.

@dweeezil
Copy link
Contributor

And to make matters even more fun, torvalds/linux@4101b62 changed it yet again (post 4.0). This issue also applies to the free_cached_objects callback. I'm working on an enhanced autoconf test.

@snajpa
Copy link
Contributor

snajpa commented Apr 16, 2015

@dweeezil If understand this correctly, echoing 1 to drop_caches is OK, when we're talking primarily about cached metadata - and that's what I've been doing from start. You're right that echoing 2 might get the system stuck in reclaim ~forever, I've experienced that. Bad idea doing that on a larger system (RHEL6 kernel, 2.6.32 patched heavily).

@dweeezil
Copy link
Contributor

I just posted pull request #3308 to properly enable the per-superblock shrinker callbacks.

I'm wondering, however, whether we ought to actually do something in the free_cached_objects function. It seems like we ought to call arc_do_user_prune() utilizing the count (which, of course, can't be done right now - it's static plus likely a major layering violation). This might obviate the need for the newly-added restart logic in arc_adjust_meta().

@dweeezil
Copy link
Contributor

@snajpa Yes, my comments apply only to the echo 2 or echo 3 cases which I've seen suggested from time to time.

@angstymeat
Copy link
Author

My backup has now been running for 15 hours. I'm going to stop it, apply #3308, leave zfs_arc_meta_prune at its default value, and try it again.

@dweeezil
Copy link
Contributor

@angstymeat Please hold off just a bit on applying that patch. I'm trying to get a bit of initial testing done right now and am going to try to do something to zpl_free_cached_objects() as well.

@angstymeat
Copy link
Author

I just finished compiling and I'm rebooting, but I'll hold off on doing anything.

@dweeezil
Copy link
Contributor

@angstymeat I'm not going to have a chance to work with this more until later. The latest commit in the branch (6318203) should be safe, however, I'll be surprised it if helps. That said, it certainly shouldn't hurt matters any.

@angstymeat
Copy link
Author

I'll try it out just to make sure nothing breaks, then.

@dweeezil
Copy link
Contributor

@angstymeat As I pointed out in a recent comment in #3308, it's very unlikely to make any difference at all and may even make things worse. Have you been able to grab arcstats (arc_prune in particular) during the problem yet?

@angstymeat
Copy link
Author

I wasn't expecting anything, and I tried it with the April 17th ZFS commits. I ran the backups and arc_adapt continued running afterwards. It's hard to tell, but it could be running with a little more CPU usage than before. It's range before was around 3% to 3.8%, not it looks like it is between 3.8% to 4.3%.

It's been going for about 6 hours now, and arc_prune is incrementing. It looks like it's incrementing around 3680 per second.

Here's arcstats:

name                            type data
hits                            4    44003792
misses                          4    35268521
demand_data_hits                4    2409447
demand_data_misses              4    296772
demand_metadata_hits            4    25134591
demand_metadata_misses          4    30445808
prefetch_data_hits              4    3111351
prefetch_data_misses            4    1837834
prefetch_metadata_hits          4    13348403
prefetch_metadata_misses        4    2688107
mru_hits                        4    11009034
mru_ghost_hits                  4    119660
mfu_hits                        4    16535007
mfu_ghost_hits                  4    145394
deleted                         4    34539715
recycle_miss                    4    32133498
mutex_miss                      4    131726
evict_skip                      4    113792369794
evict_l2_cached                 4    0
evict_l2_eligible               4    727929288704
evict_l2_ineligible             4    74060988416
hash_elements                   4    126993
hash_elements_max               4    619388
hash_collisions                 4    2202365
hash_chains                     4    3824
hash_chain_max                  4    6
p                               4    8338591744
c                               4    8338591744
c_min                           4    4194304
c_max                           4    8338591744
size                            4    7512477800
hdr_size                        4    51871896
data_size                       4    0
meta_size                       4    2098882560
other_size                      4    5361723344
anon_size                       4    18759680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1571022848
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    509100032
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    963352932
arc_meta_used                   4    7512477800
arc_meta_limit                  4    6253943808
arc_meta_max                    4    7628319744

and again from 30 seconds later:

name                            type data
hits                            4    44003792
misses                          4    35268521
demand_data_hits                4    2409447
demand_data_misses              4    296772
demand_metadata_hits            4    25134591
demand_metadata_misses          4    30445808
prefetch_data_hits              4    3111351
prefetch_data_misses            4    1837834
prefetch_metadata_hits          4    13348403
prefetch_metadata_misses        4    2688107
mru_hits                        4    11009034
mru_ghost_hits                  4    119660
mfu_hits                        4    16535007
mfu_ghost_hits                  4    145394
deleted                         4    34539715
recycle_miss                    4    32133498
mutex_miss                      4    131726
evict_skip                      4    113792369794
evict_l2_cached                 4    0
evict_l2_eligible               4    727929288704
evict_l2_ineligible             4    74060988416
hash_elements                   4    126993
hash_elements_max               4    619388
hash_collisions                 4    2202365
hash_chains                     4    3824
hash_chain_max                  4    6
p                               4    8338591744
c                               4    8338591744
c_min                           4    4194304
c_max                           4    8338591744
size                            4    7512477800
hdr_size                        4    51871896
data_size                       4    0
meta_size                       4    2098882560
other_size                      4    5361723344
anon_size                       4    18759680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1571022848
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    509100032
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    963412324
arc_meta_used                   4    7512477800
arc_meta_limit                  4    6253943808
arc_meta_max                    4    7628319744

@dweeezil
Copy link
Contributor

@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is xattr=sa set?

The problem you're facing is that the new logic in arc_adjust_meta() is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.

I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.

@angstymeat
Copy link
Author

No selinux, but xattr=sa is set on all of the filesystems since we're backing up a number of systems that have them.

@dweeezil
Copy link
Contributor

@angstymeat Do you have any non-default settings for primarycache or any module parameters, especially any related to prefetch? I'd also like to confirm that there is only one pool on the system. I'm asking these questions because of how large evict_skip is and also that data_size is zero.

I've not been able to duplicate this problem yet but it sounds like the key is that you've also likely got a pretty heavy write load going on at some time as well as all the filesystem traversal.

@angstymeat
Copy link
Author

There is one pool called "storage".

I have atime=off, xattr=sa, relatime=off, and acltype=posixacl. I also have com.sun:auto-snapshot set to true or false on various filesystems.

EDIT:
All of my module parameters should be standard. I've removed any modprobe.d settings so they don't interfere with the debugging.

I do see that after the first 25 to 30 minutes of my backups running that I lose all of the data-portion of the ARC and I'm only caching metadata.

I use Cacti to graph some of my arc stats like cache hits & misses, size of the cache, arc size, meta_used, etc. Is there anything I can add to my graphs that would help?

@spacelama
Copy link

I had arc_adapt spinning at 80% cpu after a week of uptime on 0.6.4

xattr=on, but not being used AFAIK.

I'll try grabbing stats next time instead of emergency-rebooting.

@spacelama
Copy link

On Sun, 19 Apr 2015, Tim Chase wrote:

@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is xattr=sa set?

The problem you're facing is that the new logic in arc_adjust_meta() is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.

I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.

arc_adapt 3% to 66% usage here (occasionally 100%). nfs serving of 600kB
files incredibly slow (seconds). 3 devices in pool have to do 160tps in
iostat -k 1 for the several seconds it takes to load up a single 600kB
file. NFS serving is of tank3 filesystems below:

zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 2.72T 2.21T 522G - 63% 81% 1.00x ONLINE -
tank3 5.45T 3.24T 2.22T - 15% 59% 1.00x ONLINE -
tank4 17.5G 960K 17.5G - 0% 0% 1.00x ONLINE -

zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 2.21T 479G 120K /tank
tank/backuppc 2.21T 479G 2.20T legacy
tank/backuppc@20150430 3.94G - 2.19T -
tank/backuppc@20150501 900M - 2.19T -
tank/backuppc@20150502 222M - 2.19T -
tank/backuppc@20150503 94.8M - 2.20T -
tank3 2.16T 1.42T 256K /tank3
tank3/.mp3.low_qual 7.09G 1.42T 7.09G /home/tconnors/.mp3.low_qual
tank3/apt_archives 28.9G 1.42T 28.9G /var/cache/apt/archives
tank3/background 268G 1.42T 268G /home/tconnors/background
tank3/mp3 159G 1.42T 159G /home/tconnors/mp3
tank3/images 820G 1.42T 820G /home/tconnors/images
tank3/photos 194G 1.42T 194G /home/tconnors/photos
tank3/pvr 365G 1.42T 365G /home/tconnors/movies/kaffeine
tank3/qBT_dir 165G 1.42T 165G /home/tconnors/qBT_dir
tank3/raid 26.0G 1.42T 26.0G -
tank3/scratch 128K 1.42T 128K /home/tconnors/scratch
tank3/sysadmin 7.72G 1.42T 7.72G /home/tconnors/sysadmin
tank3/thesis 166G 1.42T 166G /home/tconnors/thesis
tank4 852K 17.2G 120K /tank4

zfs get xattr
NAME PROPERTY VALUE SOURCE
tank xattr on default
tank/backuppc xattr on default
tank/backuppc@20150430 xattr on default
tank/backuppc@20150501 xattr on default
tank/backuppc@20150502 xattr on default
tank/backuppc@20150503 xattr on default
tank3 xattr on default
tank3/.mp3.low_qual xattr on default
tank3/apt_archives xattr on default
tank3/background xattr on default
tank3/mp3 xattr on default
tank3/images xattr on default
tank3/photos xattr on default
tank3/pvr xattr on default
tank3/qBT_dir xattr on default
tank3/raid xattr - -
tank3/scratch xattr on default
tank3/sysadmin xattr on default
tank3/thesis xattr on default
tank4 xattr on default

zfs get primarycache
NAME PROPERTY VALUE SOURCE
tank primarycache all default
tank/backuppc primarycache all default
tank/backuppc@20150430 primarycache all default
tank/backuppc@20150501 primarycache all default
tank/backuppc@20150502 primarycache all default
tank/backuppc@20150503 primarycache all default
tank3 primarycache all default
tank3/.mp3.low_qual primarycache all default
tank3/apt_archives primarycache all default
tank3/background primarycache all default
tank3/mp3 primarycache all default
tank3/images primarycache all default
tank3/photos primarycache all default
tank3/pvr primarycache all default
tank3/qBT_dir primarycache all default
tank3/raid primarycache all default
tank3/scratch primarycache all default
tank3/sysadmin primarycache all default
tank3/thesis primarycache all default
tank4 primarycache none local

(tank4 has nothing on it, tank3/raid is a zvol, tank3 is a 3 disk pool
with log and cache, tank is a 1 disk pool with log and cache)

modinfo zfs | grep version
version: 0.6.4-1-2-wheezy
srcversion: 6189308C5B9F92C2EE5B9F0
vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions
0-0-17:07:15, Sun May 03 tconnors@fs:~ (bash)
modinfo spl | grep version
version: 0.6.4-1-wheezy
srcversion: 88A374A9A6ABDC4BD14DF0E
vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions

As documented in #3235,

  • /proc/slabinfo
  • /proc/spl/kmem/slab
  • /proc/spl/kstat/zfs/arcstats
  • /proc/meminfo
  • SysRq-t (stacks of all tasks)
  • SysRq-w (stacks of blocked tasks)

here:

http://rather.puzzling.org/~tconnors/tmp/zfs-stats.tar.gz

2 runs of perf record -ag, perf report --stdio here:

http://rather.puzzling.org/~tconnors/tmp/perf.data.old.gz
http://rather.puzzling.org/~tconnors/tmp/perf.data.gz
http://rather.puzzling.org/~tconnors/tmp/perf.txt.old.gz
http://rather.puzzling.org/~tconnors/tmp/perf.txt.gz

Workload was absolutely nil at the time, except for the 5-minutely munin
runs, which most likely didn't correspond to these times.

uptime is 13 days. Previous reboot was because same symptoms developed
after 7 days. Box is an NFS server, and does full/incremental backups of
a network of 3 other machines with up to ~800GB filesystems over rsync,
last backup probably having run 12 hours earlier (certainly not running at
this time).

Tim Connors

@woffs
Copy link

woffs commented May 13, 2015

same here, v0.6.4-12-0c60cc-wheezy, linux-3.16, two pools, dedup=off, xattr=sa.
when arc_meta is full (beyond its own limits: arc_meta_limit=38092588032, arc_meta_max=43158548680) performance is gone.
drop_caches=3 helps.
problem is new with 0.6.4, wasn't there with 0.6.0 ... 0.6.3 (or at least 0.6.2). Reminds me of old 0.5.x times, when I had to do drop_caches as well. 0.6.0ff didn't need that

@woffs
Copy link

woffs commented Jun 10, 2015

Still same problem here. Setting zfs_arc_meta_prune to large values doesn't help. Setting zfs_arc_meta_limit to a rather low value doesn't help. The problem shows up especially when traversing and renaming many entries in large directories. (I still have to use drop_caches=2, but I already experienced a deadlock there, so this is no real solution.) Is there a chance to circumvent this condition by setting parameters or to get it fixed inside ZoL? Or by patching linux or choosing another kernel version? I'm running v0.6.4-16-544f71-wheezy at the moment, kernel is 3.16.7-ckt9-3deb8u1bpo70+1 (Debian Wheezy). Thank you!

@angstymeat
Copy link
Author

Well, that stopped arc_prune and arc_adapt immediately. Do you want me try running the backup scripts again?

@kernelOfTruth
Copy link
Contributor

relevant commit:
f604673

Make arc_prune() asynchronous
...
This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured. It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.

I had to look up first what that command does ;)

@dweeezil
thanks for mentioning it

@angstymeat:

please do

@angstymeat
Copy link
Author

Now I'm super-slow. There's almost no CPU usage and rsync is reading 1 file every few seconds.

@dweeezil
Copy link
Contributor

@angstymeat Sure, I'd be interested to hear how it behaves. It would be nice to reboot, too, in order that the arcstats be zeroed.

@kernelOfTruth My initial port of the arc mutex patch reverted reclaim to the illumos behavior but it was decided later on to add support for the "balanced" mode which was instituted in bc88866 to deal with the many reports of unevictable metadata. Furthermore, the balanced mode was made the default.

In @angstymeat's arcstats, the condition arc_meta_used > arc_meta_limit is true and seems to be "stuck" that way. This will cause (by default) 4096 loops in arc_adjust_meta_balanced(), half of which will call arc_prune_async() which, itself will dispatch as many threads as there are mounted ZFS filesystems to the the prune taskq. The problem seems to be that it can't free enough metadata for the process to ever stop so the system gets into a thrashing state where it's burning most of its cpu time dispatching threads. With the illumos strategy, the per-sb callbacks aren't invoked as frequently but will still be called by the arc shrinker.

@angstymeat Could we please see arcstats during the "super-slow" time.

@angstymeat
Copy link
Author

Here's arcstats:

name                            type data
hits                            4    70384784
misses                          4    15806620
demand_data_hits                4    1608805
demand_data_misses              4    55508
demand_metadata_hits            4    51918024
demand_metadata_misses          4    10586442
prefetch_data_hits              4    111159
prefetch_data_misses            4    856433
prefetch_metadata_hits          4    16746796
prefetch_metadata_misses        4    4308237
mru_hits                        4    24851725
mru_ghost_hits                  4    334215
mfu_hits                        4    28675104
mfu_ghost_hits                  4    113754
deleted                         4    14523110
mutex_miss                      4    21340
evict_skip                      4    842402
evict_not_enough                4    177204
evict_l2_cached                 4    0
evict_l2_eligible               4    234518568960
evict_l2_ineligible             4    99826913280
evict_l2_skip                   4    0
hash_elements                   4    255160
hash_elements_max               4    346702
hash_collisions                 4    905539
hash_chains                     4    14536
hash_chain_max                  4    4
p                               4    7756597248
c                               4    8341868544
c_min                           4    33554432
c_max                           4    8341868544
size                            4    8410224768
hdr_size                        4    94918240
data_size                       4    0
meta_size                       4    2270906368
other_size                      4    6044400160
anon_size                       4    16924672
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1771446272
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    619285504
mru_ghost_evict_data            4    245149696
mru_ghost_evict_metadata        4    374135808
mfu_size                        4    482535424
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    716824576
mfu_ghost_evict_data            4    681854464
mfu_ghost_evict_metadata        4    34970112
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    5554960868
arc_meta_used                   4    8410224768
arc_meta_limit                  4    6256401408
arc_meta_max                    4    8410750480
arc_meta_min                    4    0

@kernelOfTruth
Copy link
Contributor

@dweeezil so we retained that improvement - great !

Thanks

Is there also an issue with the l2arc or am I misinterpreting this ?

evict_l2_cached                 4    0
evict_l2_eligible               4    727929288704
evict_l2_ineligible             4    74060988416
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0

@angstymeat
Copy link
Author

Looking at it again, I have a load average of 7, but my overall CPU usage is 0.5%

@dweeezil
Copy link
Contributor

@angstymeat Unfortunately, your meta_used is blowing by the meta_limit without the balanced mode enabled. I've always been a bit nervous about the default parameters for balanced mode. I think you should try switching back to balanced mode (echo 1 > /sys/module/zfs/parameters/zfs_arc_meta_strategy) and then lower zfs_arc_meta_adjust_restarts to 2 and play with different values of arc_meta_prune, maybe starting at 100000 or so. Larger values shouldn't make a huge difference. The value is ultimately passed into the kernel's generic sb shrinker (super_cache_scan()) where it's used to proportion the amount of the inode and dentry cache which it attempts to prune.

FWIW, I've not been able to reproduce this problem on my test system. I've created 100 filesystems and populated a source filesystem with 3.5M files. RAM has been limited to 16GiB and CPU's to 8. Then I run 10 concurrent rsyncs from the source filesystem into 10 of the 100 I created. It is rather slow, but I've never been able to get the CPU spinning being reported here. It seems there must be some other variable I'm missing. How many rsyncs are running? Do all the files have xattrs (posixacl)? You are still running with xattr=sa, right?

@angstymeat
Copy link
Author

Well, that's funny. It's a good way to pause ZFS on my system. :)

I set zfs_arc_meta_strategy back to 1 and everything started back up again.

@angstymeat
Copy link
Author

The script initially starts 18 rsyncs. Most of them complete in just a few minutes, leaving about 5 that go for 30 minutes. An hour after those are started (the other jobs are complete by now), another cron runs 4 more -- the big ones that pull our email and data-storage systems.

The first group usually runs ok, and I have stalling when the 2nd group runs. Since I applied #3481 and have been running 0.6.4-98_g06358ea, I'm seeing the problems happen sooner while running the initial set of rsyncs. It seems like I hit the memory limit faster with the newer commits than I do with the released 0.6.4.

I'm already running with arc_meta_prune at 100000, and I just set zfs_arc_meta_adjust_restarts down to 2. So far, arc_adapt isn't running very much and the system looks like it's been able to free some memory.

I turned off xattr=sa a week or two ago on all of the filesystems, except for one. It turns out we only had one machine that was actually using them.

I don't suppose that my pool could somehow have some kind of metadata damage that is causing problems? I run a scrub every month or so and it hasn't reported any problems.

@angstymeat
Copy link
Author

Performance seems to be a bit slower than it was before I set the meta_strategy to 0, but without `arc_pruneorarc_adapt`` taking up a lot of CPU.

@dweeezil
Copy link
Contributor

@angstymeat I'd also like to point out the new zfs_arc_overflow_shift parameter which is likely coming into play here. It is documented, BTW and controls some brand new logic which came along with the arcs_mtx restructuring.

Here's a brief outline: It controls whether the ARC is considered to be "overflowing". In your last arcstats, for example, arc_c=8341868544 and arc_size=8410224768 so the ARC is overflowing by 68356224 bytes. The default overflow shift of 8 will set the threshold in this case to 32852440 bytes (8410224768/256) so the overflow condition is in force.

The overflow condition is used in arc_get_data_buf(), where buffers are allocated. If the ARC is in the overflowing condition, the reclaim (adapt) thread is woken up and the allocation pauses until the reclaim is finished. This is likely what's causing your slowdown (little CPU utilization and low I/O rates); at least when you aren't using balanced mode.

The bottom line, however, is that the problem is being caused by not being able to evict enough metadata quickly. The metadata is pinned by the higher level structures in the kernel. Balanced mode definitely seems to help under Linux, but it seems it can cause real problems under some conditions. The key is likely going to be to find better settings for zfs_arc_meta_prune and zfs_arc_meta_adjust_restarts. The current scheme effectively calls the generic Linux superblock shrinker with multiples of zfs_arc_meta_prune on odd-numbered iterations where zfs_arc_meta_adjust_restarts controls the number of iterations. I had suggested using 100000,2 for the parameters and it seems to mostly help, maybe something like 100000,4 or 100000,6 would be better. Since you can reproduce this condition easily it would be interesting if you could find the best settings. I never cared much for the default iteration count of 4096 but it didn't cause much problem in the past. The reason it's a big deal now is that the pruning is done in separate threads and there is a thread for each mounted filesystem. I'm thinking of opening a separate issue specifically to address the taskq design because it's something that could be changed pretty easily and it seems to be having some unintended consequences.

@angstymeat
Copy link
Author

Thanks for the detailed explanation of when eviction is triggered. I'll try different values for zfs_arc_meta_adjust_restarts and see if I can find anything that helps.

Is there a reason it cannot catch up after the rsyncs finish? I regularly see that my rsync processes will terminate and it it will keep running arc_adapt continuously afterwards when the rest of the system is idle. I let it go one time and it finally stopped around 20 hours later.

@woffs
Copy link

woffs commented Jun 14, 2015

Meanwhile, my machine is running smoothly for 4 days now and finished the backup process 4 times in 5½h every day, which is quite good. What I did:

  • booted into Linux 3.2
  • set zfs_arc_meta_adjust_restarts=64 zfs_arc_meta_prune=100000 vm.vfs_cache_pressure=100
  • limited the number of parallel running rsync/find's to 4 (was at 6 before)

I have still zfs-v0.6.4-16-544f71-wheezy running.
arc_meta_used is oscillating around arc_meta_limit (=39GB) and never been larger than arc_meta_limit+130MB

@dweeezil
Copy link
Contributor

@angstymeat I may have found the problem. If you've got RAM in each of your Opteron NUMA nodes, the zfs_sb_prune() function will only prune from NUMA node 0 when running on >= 3.12 kernels. You can check with dmesg | grep "On node" and see whether the kernel is using memory on each node. If you've got a 2-node NUMA system with RAM on each node, please try #3495 and see if it doesn't allow for more metadata to be freed.

@angstymeat
Copy link
Author

Wow, that's a good catch! I'm trying to now!

@angstymeat
Copy link
Author

So far, so good! It's a big difference. I posted a small comment over at #3495 which pretty much says this seems to be working.

htop is showing the system had been hovering around 11GB in use and it suddenly jumped up to 15.5GB when my mail backups started. arc_adapt ran for a second or two along with some arc_prune threads, and I regained 1.5GB without impacting the running processes.

This would really explain why not all of our systems are experiencing this. Also, I think our virtual machines don't suffer from this since it appears that no matter how many CPUs they have the machines only see that they have 1 NUMA node.

The jobs are starting to finish up now, and my memory in use has dropped to 11.5GB with a couple still running. For the last 6 months I wouldn't see it go under 14GB at this point.

I'm really thinking this is it.

@angstymeat
Copy link
Author

Also, yes, I'm showing that there's memory in 2 NUMA nodes on this system.

@spacelama
Copy link

On Sat, 13 Jun 2015, Tim Chase wrote:

FWIW, I've not been able to reproduce this problem on my test system. I've created 100 filesystems and populated a source filesystem with 3.5M files. RAM has been limited to 16GiB and CPU's to 8. Then I run 10 concurrent rsyncs from the source filesystem into 10 of the 100 I created. It is rather slow, but I've never been able to get the CPU spinning being reported here. It seems there must be some other variable I'm missing. How many rsyncs are running? Do all the files have xattrs (posixacl)? You are still running with xattr=sa, right?

Do you run with any uptime in your tests? I always get initial good
behaviour from ZFS, but then after a week it all blows up. I assume you
create those millions of files in a complex directory structure too, just
to make sure your testing caching of directory entries vs files?

@angstymeat
Copy link
Author

I'm closing this because it looks like it's been solved with #3495.

@spacelama
Copy link

Hmmm, I've recently had this bug pop up, and I'm not running a numa machine, so #3495 won't be my solution. I'll have to open a new bug when the next release is out (for now, master was so bad that I had to downgrade back to 0.6.4)

@angstymeat
Copy link
Author

@spacelama You might want to look at #3501. They mention in one of the comments that is solves a similar issue for older kernels that #3495 addresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants