-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc_adapt left spinning after rsync with lots of small files #3303
Comments
@angstymeat I've started looking over your debug output. It's not really spinning in You might try increasing If all else fails and you want to try something which should immediately free some metadata, try settinng Regardless, I think you should avoid using |
I had been running with I really want to avoid using If 100,000 isn't enough, do you have a recommendation for a higher number? Should I try increasing the order of magnitude each time I test it? Is there an upper limit value that will cause more problems than it solves? |
Also, does |
I just tried setting `zfs_arc_meta_limit |
@dweezil what's your reasoning behind going as far as to say it would be best to avoid drop_caches completely? Since the VFS dentry cache can't be bypassed I see no other option, than to use this as a workaround. I've set vfs_cache_pressure to really high number (IIRC 1M), but in the end the dentry cache over time causes ARC as whole to shrink. My use-case is OpenVZ containers, where others use KVM. I'm just being curious, what negative side effects does drop_caches have? Besides obviously dropping even dentries for non-ZFS data (though that's not a problem in my scenario).
|
I've set My rsyncs are still running so I'm going to wait until it's finished and see if arc_adapt continues to run afterwards. So far, I've tried this with Also, I'm seeing |
@snajpa Regarding potential issues when using @angstymeat Your issue certainly appears to be the same as that which bc88866 and 2cbb06b were intended to fix. The question is why it's not working in your case. I supposed a good first step would be to watch the value of As to increasing |
I think I found the problem. torvalds/linux@9b17c6238 (appeared first in kernel 3.12) added a node id argument to As a hack, you can try:
I'll see if I can work up a proper autoconf test later today (which will handle the intermediate kernels). We'll need yet another flag to handle < 3.12 kernels which still do have the callback. |
And to make matters even more fun, torvalds/linux@4101b62 changed it yet again (post 4.0). This issue also applies to the free_cached_objects callback. I'm working on an enhanced autoconf test. |
@dweeezil If understand this correctly, echoing 1 to drop_caches is OK, when we're talking primarily about cached metadata - and that's what I've been doing from start. You're right that echoing 2 might get the system stuck in reclaim ~forever, I've experienced that. Bad idea doing that on a larger system (RHEL6 kernel, 2.6.32 patched heavily). |
I just posted pull request #3308 to properly enable the per-superblock shrinker callbacks. I'm wondering, however, whether we ought to actually do something in the |
@snajpa Yes, my comments apply only to the |
My backup has now been running for 15 hours. I'm going to stop it, apply #3308, leave zfs_arc_meta_prune at its default value, and try it again. |
@angstymeat Please hold off just a bit on applying that patch. I'm trying to get a bit of initial testing done right now and am going to try to do something to |
I just finished compiling and I'm rebooting, but I'll hold off on doing anything. |
@angstymeat I'm not going to have a chance to work with this more until later. The latest commit in the branch (6318203) should be safe, however, I'll be surprised it if helps. That said, it certainly shouldn't hurt matters any. |
I'll try it out just to make sure nothing breaks, then. |
@angstymeat As I pointed out in a recent comment in #3308, it's very unlikely to make any difference at all and may even make things worse. Have you been able to grab arcstats (arc_prune in particular) during the problem yet? |
I wasn't expecting anything, and I tried it with the April 17th ZFS commits. I ran the backups and arc_adapt continued running afterwards. It's hard to tell, but it could be running with a little more CPU usage than before. It's range before was around 3% to 3.8%, not it looks like it is between 3.8% to 4.3%. It's been going for about 6 hours now, and Here's arcstats:
and again from 30 seconds later:
|
@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is The problem you're facing is that the new logic in I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior. |
No selinux, but xattr=sa is set on all of the filesystems since we're backing up a number of systems that have them. |
@angstymeat Do you have any non-default settings for I've not been able to duplicate this problem yet but it sounds like the key is that you've also likely got a pretty heavy write load going on at some time as well as all the filesystem traversal. |
There is one pool called "storage". I have EDIT: I do see that after the first 25 to 30 minutes of my backups running that I lose all of the data-portion of the ARC and I'm only caching metadata. I use Cacti to graph some of my arc stats like cache hits & misses, size of the cache, arc size, meta_used, etc. Is there anything I can add to my graphs that would help? |
I had arc_adapt spinning at 80% cpu after a week of uptime on 0.6.4 xattr=on, but not being used AFAIK. I'll try grabbing stats next time instead of emergency-rebooting. |
On Sun, 19 Apr 2015, Tim Chase wrote:
arc_adapt 3% to 66% usage here (occasionally 100%). nfs serving of 600kB
(tank4 has nothing on it, tank3/raid is a zvol, tank3 is a 3 disk pool
As documented in #3235,
here: http://rather.puzzling.org/~tconnors/tmp/zfs-stats.tar.gz 2 runs of http://rather.puzzling.org/~tconnors/tmp/perf.data.old.gz Workload was absolutely nil at the time, except for the 5-minutely munin uptime is 13 days. Previous reboot was because same symptoms developed Tim Connors |
same here, v0.6.4-12-0c60cc-wheezy, linux-3.16, two pools, dedup=off, xattr=sa. |
Still same problem here. Setting zfs_arc_meta_prune to large values doesn't help. Setting zfs_arc_meta_limit to a rather low value doesn't help. The problem shows up especially when traversing and renaming many entries in large directories. (I still have to use drop_caches=2, but I already experienced a deadlock there, so this is no real solution.) Is there a chance to circumvent this condition by setting parameters or to get it fixed inside ZoL? Or by patching linux or choosing another kernel version? I'm running v0.6.4-16-544f71-wheezy at the moment, kernel is 3.16.7-ckt9-3 |
Well, that stopped |
relevant commit:
I had to look up first what that command does ;) @dweeezil please do |
Now I'm super-slow. There's almost no CPU usage and rsync is reading 1 file every few seconds. |
@angstymeat Sure, I'd be interested to hear how it behaves. It would be nice to reboot, too, in order that the arcstats be zeroed. @kernelOfTruth My initial port of the arc mutex patch reverted reclaim to the illumos behavior but it was decided later on to add support for the "balanced" mode which was instituted in bc88866 to deal with the many reports of unevictable metadata. Furthermore, the balanced mode was made the default. In @angstymeat's arcstats, the condition @angstymeat Could we please see arcstats during the "super-slow" time. |
Here's arcstats:
|
@dweeezil so we retained that improvement - great ! Thanks Is there also an issue with the l2arc or am I misinterpreting this ?
|
Looking at it again, I have a load average of 7, but my overall CPU usage is 0.5% |
@angstymeat Unfortunately, your meta_used is blowing by the meta_limit without the balanced mode enabled. I've always been a bit nervous about the default parameters for balanced mode. I think you should try switching back to balanced mode ( FWIW, I've not been able to reproduce this problem on my test system. I've created 100 filesystems and populated a source filesystem with 3.5M files. RAM has been limited to 16GiB and CPU's to 8. Then I run 10 concurrent rsyncs from the source filesystem into 10 of the 100 I created. It is rather slow, but I've never been able to get the CPU spinning being reported here. It seems there must be some other variable I'm missing. How many rsyncs are running? Do all the files have xattrs (posixacl)? You are still running with |
Well, that's funny. It's a good way to pause ZFS on my system. :) I set |
The script initially starts 18 rsyncs. Most of them complete in just a few minutes, leaving about 5 that go for 30 minutes. An hour after those are started (the other jobs are complete by now), another cron runs 4 more -- the big ones that pull our email and data-storage systems. The first group usually runs ok, and I have stalling when the 2nd group runs. Since I applied #3481 and have been running 0.6.4-98_g06358ea, I'm seeing the problems happen sooner while running the initial set of rsyncs. It seems like I hit the memory limit faster with the newer commits than I do with the released 0.6.4. I'm already running with I turned off I don't suppose that my pool could somehow have some kind of metadata damage that is causing problems? I run a scrub every month or so and it hasn't reported any problems. |
Performance seems to be a bit slower than it was before I set the meta_strategy to 0, but without `arc_prune |
@angstymeat I'd also like to point out the new Here's a brief outline: It controls whether the ARC is considered to be "overflowing". In your last arcstats, for example, arc_c=8341868544 and arc_size=8410224768 so the ARC is overflowing by 68356224 bytes. The default overflow shift of 8 will set the threshold in this case to 32852440 bytes (8410224768/256) so the overflow condition is in force. The overflow condition is used in The bottom line, however, is that the problem is being caused by not being able to evict enough metadata quickly. The metadata is pinned by the higher level structures in the kernel. Balanced mode definitely seems to help under Linux, but it seems it can cause real problems under some conditions. The key is likely going to be to find better settings for |
Thanks for the detailed explanation of when eviction is triggered. I'll try different values for Is there a reason it cannot catch up after the rsyncs finish? I regularly see that my rsync processes will terminate and it it will keep running |
Meanwhile, my machine is running smoothly for 4 days now and finished the backup process 4 times in 5½h every day, which is quite good. What I did:
I have still zfs-v0.6.4-16-544f71-wheezy running. |
@angstymeat I may have found the problem. If you've got RAM in each of your Opteron NUMA nodes, the |
Wow, that's a good catch! I'm trying to now! |
So far, so good! It's a big difference. I posted a small comment over at #3495 which pretty much says this seems to be working.
This would really explain why not all of our systems are experiencing this. Also, I think our virtual machines don't suffer from this since it appears that no matter how many CPUs they have the machines only see that they have 1 NUMA node. The jobs are starting to finish up now, and my memory in use has dropped to 11.5GB with a couple still running. For the last 6 months I wouldn't see it go under 14GB at this point. I'm really thinking this is it. |
Also, yes, I'm showing that there's memory in 2 NUMA nodes on this system. |
On Sat, 13 Jun 2015, Tim Chase wrote:
Do you run with any uptime in your tests? I always get initial good |
I'm closing this because it looks like it's been solved with #3495. |
Hmmm, I've recently had this bug pop up, and I'm not running a numa machine, so #3495 won't be my solution. I'll have to open a new bug when the next release is out (for now, master was so bad that I had to downgrade back to 0.6.4) |
@spacelama You might want to look at #3501. They mention in one of the comments that is solves a similar issue for older kernels that #3495 addresses. |
This is a split off of #3235. The symptom is that after a large set up rsyncs containing lots of small files, I'm left with arc_adapt spinning at around 3.3% CPU time.
I've compiled the info that @dweeezil wanted here: https://cloud.passcal.nmt.edu/index.php/s/l52T2UhZ0K7taY9.
The system is a Dell R515 with 16GB of RAM. The pool is a single raidz2 pool made of of 7 2TB SATA drives. Compression is enabled and set to lz4. atime is off.
The OS is Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.
The machine this happens on is an offsite server that hosts our offsite backups. There are 20+ rsync processes running that send large and small files from our internal systems to this system. The majority of the files are either large data files or the files you would typically find as part of a linux installation (/usr, /bin, /var, etc.)
Also, about 50 home directories are backed up containing a mix of large and small files.
This takes about 45 minutes. One hour after these jobs are started, the email servers begin their backup (so usually a 15 minute delay between the start of one set of backups and another). Also, our large data collector sends its backups at this time. These are large files, but a lot of them. It is sometime during this 2nd stage backup that this issue occurs.
During the backup, I have another process that periodically runs an
echo 2 > /proc/sys/vm/drop_caches
. I do this because once the ARC gets filled up performance drops drastically. Without doing this, it will take up to 10 hours to perform a backup. With it, it will take less than 2 hours.This happens even if I do not run the periodic drop_caches, but seems to occur less often.
The drop_caches is a relatively new addition to my scripting on this system as this didn't appear to happen under 0.6.3. I don't have a good idea when it started, but I'm pretty sure it was sometime around the kmem rework on 0.6.4.
I am unable to rollback and test under 0.6.3 because while I was testing 0.6.4, the new feature flags were enabled. This unit is not a critical system, so I have quite a bit of leeway with it, as long as I don't have to destroy and recreate the pool. I usually use it to test new versions of ZFS since it gets a lot of disk usage.
The text was updated successfully, but these errors were encountered: