-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% CPU load from arc_prune #9966
Comments
More info:
ARC_MAX is not set, it's left at the default. Cache SSD is 160GB. All disks are dual-ported SAS drives in an external shelf. |
Even more info:
There's nothing out of the ordinary in syslog until the "hung task" |
Does it look like I'm getting hit by "don't make the cache disk bigger than 5 x 1/2 RAM" problem? I thought that was no longer a thing. |
@dmaziuk that's a possibility. Can you post the contents of the |
|
System load is down to 95-ish% now BTW (but |
According to the arcstats the ARC is entirely filled with metadata and is over the target (arc_meta_used > arc_meta_limit). The arc_prune threads are trying to to evict some of this metadata to get down to the 75% target value.
They appear to be unable to make any progress on this which is why they're spinning. If I recall correctly, this was caused by the file handles cached by the nfsd holding a reference which prevents the ARC form freeing the buffers. I believe the nfsd behavior here is a little different in newer kernels and better behaved. You should be able to stop the spinning by setting the limit to 100%. You can make this the default with the echo 16893738168 >/sys/module/zfs/parameters/zfs_arc_meta_limit |
Done, @behlendorf if you need anything else from this system, say so because my users are crying, I have to bounce it. Could it be possible to make it not use all available cores? -- it's a 16-core system that has 16 instances of |
Feel free to bounce the system, the arcstats are enough to confirm the issue. It definitely can, and should, be improved. We have some ideas about how to improve things. |
Is this from RAM ARC or cache SSD ARC? I.e. if it's the latter, can I get around it by disconnecting cache device? -- Temporarily, while moving a zillion files around. |
It's the RAM ARC, disconnecting the cache device won't help. Setting |
OK, thank you. I did set in Thanks again |
@behlendorf thanks for providing the hint, it made the spinning go away for me as well! how do you plan to solve this by default in the future without needing user intervention?
|
the workaround had only temporary effects - update: I had to increase |
eventually it's spinning again despite increasing zfs_arc_max |
Hasn't happened here -- knock on wood -- but I'm done copying lots of files. Mine's got 64G RAM and 160G cache SSD, ARC size is left at the default. It's a 2x8 opteron and it didn't lock up until it had |
I have seen this arc_prune behavior last year on one server with 32GB of ECC. I had to disable the weekly scrubs because of it. I think there is another thread where I commented about that. Since I have just moved to a new building and the network topology is completely different I can't test now. |
I'm seeing the issue again after reboot, on a simple |
I thought I'd add a note that I have also seen this on a debian buster with the debian backports version on amd64. It did clear it self out after some time (hours if I remember right). If I see it again I'll try to get more information. A couple of data points worth noting: no cache device, it only receives snapshots from a different system, little if any user io and no networks shares from it. |
I'm seeing a similar problem ( Why is there so much metadata in the ARC in the first place? If relevant, the machine in question is a dedicated backup machine, currently doing nothing other than be the receiving end of a lot of System info:
Here is the current state of the ARC (allowing 100% arc_meta, but still having prune issues):
|
Below is a crude view of the incident from a CPU load perspective across all cpus, where the middle marker is when I was bumping I don't know enough about ARC performance to know why metadata would be such a significant fraction, nor whether it is a long-term problem to move it from the default |
maybe related/duplicate: #10119 |
For what it's worth, whenever I run a workload on my system that I know will peg the
It's blunt and brutal but stops those Thanks |
not for me, i already tried that then eventually it still comes to system hang |
might be related to #7559 |
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12896 Issue openzfs#9966
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12896 Issue #9966
I just had another EL8.5 NFS server running zfs-2.1.2 get stuck with this problem. Is there an estimated timeline for releasing 2.1.3 with the initial set of patches for this? |
We're finalizing and testing the 2.1.3 patch stack at the moment in PR 13063. It will include at least one change which should help with this. |
Thanks, I have just subscribed to that PR to track its progress--much appreciated! |
FYI, I have observed a 2.1.4 system unable to keep up with arc pruning using the new default of a single thread (#13231). |
Hah, I love that that huge system (by my standards :) the extra latency is moving from 0.002s to 0.2s for a write operation. Even that worst case would have been a gloriously fast interaction vs my Very Bad Days. |
Hey, looks like I got bitten by that problem. In my case I believe the culprit was caused by happening few things at once:
For some time system struggled to be stable but it managed to get working, but second arcstat flood rendered it totally unresponsive, and after 6h I decided to force power it get it back into operation. Looks like keeping a docker on zfs + auto snapshot was really bad idea in the end - zfs-auto-snapshot was making a snap for each docker layer, so it ended in thousands) |
The one server I had issues with now works fine with zfs 2.1.4. Not sure if that's any relevant: I have serveral servers/notebooks running NixOS with ZFS. The only one that had a problem with 2.1.1 was an intel based server. Everything else is AMD. It probably shouldn't matter, it's just something I noticed now. |
Hm experienced the issue today again. I guess I will have to limit number of backups and upgrade box to 22.04 to get zfs 2.1.4. |
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12896 Issue openzfs#9966
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12896 Issue openzfs#9966
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12896 Issue openzfs#9966
I feel like I'm also hit by this 'arc_prune storm' issue on 2.1.6 (sorry not on newer 2.1.x because lack of newer packages in the Ubuntu PPA). What I noticed on the arcstats graphs is a staircase graph on the Also interesting to see there's significant increase of Then I had a look at the code responsible for this and noticed a possible issue/cause. Could it be that this part with Lines 4395 to 4412 in 184508c
value of meta_used isn't updated in between, then:Line 4477 in 184508c
Restarted 4096 times: Line 465 in 184508c
Lines 4475 to 4478 in 184508c
This is all still present in the 2.1.10-staging branch. I'm not an expert on C and the ZFS code base but this seems... odd. 🤔 I'm aware of a full rewrite of the ARC code on master/2.2, but I would really appreciate some clue on how to remove the trigger of this storm without having to install master/2.2. Is it safe to reduce Happy to provide more information as well, got all statistics at hand here. |
Does @amotin any ideas on what is causing the issue in this case? Thanks. |
Hi @shodanshok, you're not gonna believe it but that's the single metric I don't have, was pulling my hair out why... turns out it's a bug in node_exporter. 😣 prometheus/node_exporter#2656 Saw your commits and issue comments before already, proven very helpful along the way, thanks! 🙏🏼 Update: a new incident happened - I don't see avail running below 0 here: arcstat output around incident ~ 13:30-13:31
|
@gertvdijk Your analysis about meta_used not being updated between iterations looks valid. As I can see, the change was made in 37fb3e4 by @pcd1193182 as a performance optimization, that appears to also change the code logic. The proper fix would probably be to update meta_used between the iterations, but if you need to do something without recompilation then dramatic reduction of zfs_arc_meta_adjust_restarts may probably reduce the issue too. It is not a real fix though, since depending on the evicted amount you may end up either not evicting enough or not pruning enough. |
@amotin Can this be fixed in the upcoming 2.1.10 release, while having your more comprehensive ARC rewrite in 2.2? |
@gertvdijk That lack of recalculating meta_used indeed looks highly suspicious. Are you willing/able to test a patch per attached? Note: the patch has been compile tested only. |
Thanks everyone for the replies so far. Extrapolating the arc stats along the pattern for failure I should face the next incident within the next 2-4 hours. 🥲 To elaborate on the I/O patterns I do on this machine which may be relevant in this case ... (click to expand)'Thousands' of disk images being accessed over NFS, opened as file-backed loop device on other machines. The images are files on a plain single ZFS dataset and are sparse. ZFS+NFS is the only thing this machine does basically. Continuous frequent snapshots + removal of them, combined with send/receive (replication). Mostly small random reads and sync (over)writes across the files, so using a fairly small recordsize of 16K as a balance between a low/good write amplification while not being penalized too much capacity-wise with the RAIDZ vdev setup. The pool layout is all-NVMe-SSD 10-disk one on RAIDZ2, with an extra-low-latency NVMe LOG disk (again mostly do random writes coalescing reducing the WAF). No L2ARC cache vdev. No other pools active on this server. Tweaks: ARC max to 83% of the 256GB RAM, @chrisrd Thanks also for your confirmation and patch on this!
Update 2: Reducing |
It turns out to be possible to derive the 'avail' / 'memory_available_bytes' information from other metrics that the Prometheus node_exporter provides, because in the code the 'avail' computation is just 'memory_free_bytes - arc_sys_free', and better yet the latter is generally a constant for a given system. (I discovered this today while digging into the statistics. See module/os/linux/zfs/arc_os.c for arc_available_memory(), which is directly used to generate the kstat when you look, and internally in the ARC code in relevant places.) |
Reducing By setting Line 464 in 184508c
Lines 4523 to 4526 in 184508c
Surprisingly, this is a dynamic tunable, so without any downtime, I was able changed this last Thursday: $ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_arc_meta_strategy Again, a preliminary conclusion I draw from this, but looks significant in seeing two full-load days without an arc prune storm so far. HTH Total ARC size, metadata and data size plotted; vertical blue line indicating change of
|
As this has been open for a while, I'd like to inquire if folks are still seeing this issue, and (importantly) if you have any specific way(s) to repro the condition, so that it can be investigated. |
I remember when I had issues back then on 0.8.x then I switched to 2.x and no issues since then.
|
System information
Describe the problem you're observing
Moving 14TB of external data into a zfs dataset, after a couple of days the CPU is at 99% system load. Clients time out trying to mount the dataset (exported via
sharenfs
).Describe how to reproduce the problem
It's the first one for me, no idea if it's reproducible.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: