Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc_shrinker_scan() on Linux should be reentrancy-safe #10986

Closed
adamdmoss opened this issue Sep 27, 2020 · 5 comments
Closed

arc_shrinker_scan() on Linux should be reentrancy-safe #10986

adamdmoss opened this issue Sep 27, 2020 · 5 comments
Labels
Status: Stale No recent activity for issue Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@adamdmoss
Copy link
Contributor

System information

Type Version/Name
Distribution Name ubuntu
Distribution Version 19.04
Linux Kernel Linux version 5.4.0-42-generic (buildd@lgw01-amd64-023) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu118.04)) #4618.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020
Architecture x86_64
ZFS Version master cf26677
SPL Version master cf26677

Describe the problem you're observing

The os/linux/zfs/arc_os.c arc_shrinker_scan() may be invoked reentrantly, i.e. re-invoked before a previous invocation has completed.

This isn't inherently a problem but I'm not convinced that the downstream code (arc_reduce_target_size() and arc_wait_for_eviction()) is expecting this / robust to this.

Describe how to reproduce the problem

Found while looking into arc collapse; instrumenting arc_shrinker_scan() with an atomic enter/leave counter should reveal this problem.

I'm not sure if this is unique to either direct or indirect claim, though either can be a victim.

I'm also not sure if this is due to arc_shrinker_scan() being invoked from multiple (kswapd?) threads or whether zfs itself is causing stop-the-world emergency direct calls to its own shrinker.

I suspect it's a mix, mostly the latter, because arc_shrinker_scan() waits on eviction to shrink memory usage, but the eviction path requires further allocations, certainly a troubling combination when trying to respond to critical memory pressure. This is also probably a bug or design mishap which I'm happy to file separately if you think it's useful.

@adamdmoss adamdmoss added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Sep 27, 2020
@adamdmoss
Copy link
Contributor Author

(FAO @ahrens I think)

@adamdmoss
Copy link
Contributor Author

I have a band-aid patch but I believe it raises the likelihood of the OOM-killer kicking-in because it involves lying to the kernel about the progress we're making - I didn't finesse the patch so possibly that can be cured by ensuring that arc_shrinker_count() always return 0 while a shrink is already in progress.

@ahrens
Copy link
Member

ahrens commented Sep 27, 2020

The ARC shrinking down to nothing (when there isn't memory pressure to justify it) is a problem, which should have been addressed by my earlier work. Sounds like that may have not covered all cases. Can you provide some more details about what's causing this?

I'm not convinced that the downstream code (arc_reduce_target_size() and arc_wait_for_eviction()) is expecting this / robust to this.

What problem are you seeing with concurrent calls to these routines?

@stale
Copy link

stale bot commented Sep 28, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Sep 28, 2021
@behlendorf
Copy link
Contributor

As commented above it's my understanding that Matt's recent work addressed this issue. If that's not the case, please let us know and we can reopen this. Hopefully with from fresh details to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants