-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM on zfs scrub #13546
Comments
As would I, and we may in fact have just merged a fix for this. The patch in #13537 resolves a sequential scan memory accounting bug which would cause ZFS to underestimate the amount of memory in use. This could potentially lead to an OOM if memory was tight on the system when scrubbing. Once the legacy scrub completes would you mind testing out the small patch in the PR (commit 87b46d6)? |
Brian, $ zfs version edit: I have stopped the urbackupsrv process during the scrub, just to give it the most potential resources to complete. |
Good news: it got 38% through a scrub without OOMing. Bad news: it took several days to get to 38% as it's only scanning ~18MB/s. I'm not sure why it is going so slowly, I have not tweaked any of the ZFS settings for legacy scan or anything else since rebuilding the ZFS utils/kmods. Also, after several days and 38%, I had a power outage that caused everything to need to be rebooted and the scrub resumed. I'm not sure if either of these affects the overall testing. Just to double-check: /sys/module/zfs/parameters$ cat zfs_scan_legacy |
After rebooting and resuming the scrub, it's going at ~80MB/s, considerably faster than before. It's 73% done at the moment, which is well beyond the point it would have OOMed in the past. I think the patch can tentatively be considered working. I will reply once more when the scrub completes, assuming it doesn't fail some time in the next 16 hours. scan: scrub in progress since Sun Jun 12 18:39:43 2022 scrub had reached ~38% from Sunday until yesterday. The rest has been scrubbed in the last 22 hours since it was rebooted after power failure. $ free $ w Only up 22h since reboot, load is about expected while doing a scrub. Latency is fine, although disk I/O is somewhat degraded, as expected. $ zfs version Making sure I'm still running the right patch. |
Success, it completed with no OOM! |
That's great news. This fix is already in the master branch and will be included in the planned 2.1.5 release. Then I think we can close this out. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Confirming close. No OOMs since the original problem was fixed. |
System information
Describe the problem you're observing
zpool scrub causes System-wide OOM after several hours, killing the machine
Describe how to reproduce the problem
zpool scrub [pool name]
wait an hour to many hours
check console that the system has OOM and died
Include any warning/errors/backtraces from the system logs
Unfortunately I don't have any kind of core or logs because the whole system bombs when it goes OOM and can't do anything. If there is a way I can grab a kernel dump or any kind of logging beforehand, I am more than willing to give it a go.
System Information
Originally created with Ubuntu 20.04 default ZFS version (0.8.4 or something?) - I have not yet done a "zpool upgrade" on it, in case I want to go back to Ubuntu default version. 2.0.7 built from git source and installed as debs.
Encryption enabled on "main" vol, sub-vol created under that e.g.:
tank/
tank/encrypted/
tank/encrypted/backup_1/
tank/encrypted/backup_2/
tank/encrypted/urbackup/
etc.
I know USB isn't a "recommended" bus type for ZFS, but this is my nearline/backup server, the storage for which exists in an easy-to-access external cabinet that is part of my "bug out" kit. I'm in FL, we get storms and hurricanes and probably plagues of frogs eventually, if I have to leave for any emergency I want to be able to easily grab my backup and bring everything with me without carrying a 40lb server. (Yes I also have offsite and other backups, but I'm super paranoid.)
If I try to scrub the external array, it will eventually OOM. It may happen in an hour, it may happen in a day. Unfortunately, I have never "caught" it going OOM. Everything seems nominal until it's not, in terms of memory, CPU, and disk I/O. It doesn't matter how long I leave the scrub running, if I stop the scrub before the system OOMs, it's fine.
I'm currently doing a scrub on it with zfs_scan_legacy=1 as suggested in this issue: ##11574
It's been going for a few days now without OOM, with several days left to go. (Legacy scan is considerably slower than the current scan mode. It would typically get over 10% at least within a couple of hours before eventually dying, already after 3+ days, it's only at 20%.)
I have yet to try connecting the enclosure to another machine, or try connecting the drives directly to the SATA bus on the motherboard. This is my next step to see if I can complete a scrub without having to resort to legacy mode once the current scrub either completes or dies.
The machine periodically runs rsync against the "main" file server to back it up, as well as urbackup to back up the Windows machine I have. I can stop all rsync/urbackupsrv tasks and let it scrub and it will still OOM after a while. It has been suggested that, because urbackup creates so many files in its directories, ZFS is dying trying to read all the metadata and I need to add more RAM. I'd prefer to find a solution that fixes rather than bandaids the problem.
Running multiple rsyncs and urbackup jobs simultaneously taxes the system as expected, but it handles it with no issues. A scrub seems to be the only thing that will take it down.
The text was updated successfully, but these errors were encountered: