OOM on zfs scrub #13546

dglidden · 2022-06-10T20:42:49Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	20.04.4
Kernel Version	5.13.0-44-generic
Architecture	x86_64
OpenZFS Version	zfs-2.0.7-1

Describe the problem you're observing

zpool scrub causes System-wide OOM after several hours, killing the machine

Describe how to reproduce the problem

zpool scrub [pool name]
wait an hour to many hours
check console that the system has OOM and died

Include any warning/errors/backtraces from the system logs

Unfortunately I don't have any kind of core or logs because the whole system bombs when it goes OOM and can't do anything. If there is a way I can grab a kernel dump or any kind of logging beforehand, I am more than willing to give it a go.

System Information

Component	Type
CPU	i3-4160 @ 3.6Ghz
Memory	16GB RAM
Swap	16GB swap
Bus Type	External USB3 "PROBOX" 4-drive enclosure (I know, I know, bear with me here, I'll explain in a minute)
Drives	4x 6TB Seagate NAS drives (NOT SHINGLED)
Format	RAIDZ1

SIZE	ALLOC	FREE
21.8T	15.3T	6.46T

Originally created with Ubuntu 20.04 default ZFS version (0.8.4 or something?) - I have not yet done a "zpool upgrade" on it, in case I want to go back to Ubuntu default version. 2.0.7 built from git source and installed as debs.

Encryption enabled on "main" vol, sub-vol created under that e.g.:

tank/
tank/encrypted/
tank/encrypted/backup_1/
tank/encrypted/backup_2/
tank/encrypted/urbackup/

etc.

I know USB isn't a "recommended" bus type for ZFS, but this is my nearline/backup server, the storage for which exists in an easy-to-access external cabinet that is part of my "bug out" kit. I'm in FL, we get storms and hurricanes and probably plagues of frogs eventually, if I have to leave for any emergency I want to be able to easily grab my backup and bring everything with me without carrying a 40lb server. (Yes I also have offsite and other backups, but I'm super paranoid.)

If I try to scrub the external array, it will eventually OOM. It may happen in an hour, it may happen in a day. Unfortunately, I have never "caught" it going OOM. Everything seems nominal until it's not, in terms of memory, CPU, and disk I/O. It doesn't matter how long I leave the scrub running, if I stop the scrub before the system OOMs, it's fine.

I'm currently doing a scrub on it with zfs_scan_legacy=1 as suggested in this issue: ##11574

It's been going for a few days now without OOM, with several days left to go. (Legacy scan is considerably slower than the current scan mode. It would typically get over 10% at least within a couple of hours before eventually dying, already after 3+ days, it's only at 20%.)

I have yet to try connecting the enclosure to another machine, or try connecting the drives directly to the SATA bus on the motherboard. This is my next step to see if I can complete a scrub without having to resort to legacy mode once the current scrub either completes or dies.

The machine periodically runs rsync against the "main" file server to back it up, as well as urbackup to back up the Windows machine I have. I can stop all rsync/urbackupsrv tasks and let it scrub and it will still OOM after a while. It has been suggested that, because urbackup creates so many files in its directories, ZFS is dying trying to read all the metadata and I need to add more RAM. I'd prefer to find a solution that fixes rather than bandaids the problem.

Running multiple rsyncs and urbackup jobs simultaneously taxes the system as expected, but it handles it with no issues. A scrub seems to be the only thing that will take it down.

behlendorf · 2022-06-10T22:55:34Z

I'd prefer to find a solution that fixes rather than bandaids the problem.

As would I, and we may in fact have just merged a fix for this. The patch in #13537 resolves a sequential scan memory accounting bug which would cause ZFS to underestimate the amount of memory in use. This could potentially lead to an OOM if memory was tight on the system when scrubbing.

Once the legacy scrub completes would you mind testing out the small patch in the PR (commit 87b46d6)?

dglidden · 2022-06-12T22:47:51Z

Brian,
Thanks for the reply. I gave up on the slow scrub since it's been at "6 days left" for the last two days. I've applied that PR and made sure everything is up-to-date on the system. I believe I pulled the right code in, as the sdl_scan.c file was the only thing modified, and I checked that the change was present. Rebuilt ZFS utils/kmods and have started a scrub. I will follow up with results when it either completes or dies.

$ zfs version
zfs-2.0.7-1_gd84d6b905
zfs-kmod-2.0.7-1_gd84d6b905

edit: I have stopped the urbackupsrv process during the scrub, just to give it the most potential resources to complete.

dglidden · 2022-06-16T21:02:41Z

Good news: it got 38% through a scrub without OOMing.

Bad news: it took several days to get to 38% as it's only scanning ~18MB/s. I'm not sure why it is going so slowly, I have not tweaked any of the ZFS settings for legacy scan or anything else since rebuilding the ZFS utils/kmods. Also, after several days and 38%, I had a power outage that caused everything to need to be rebooted and the scrub resumed. I'm not sure if either of these affects the overall testing.

Just to double-check:

/sys/module/zfs/parameters$ cat zfs_scan_legacy
0

dglidden · 2022-06-17T19:44:33Z

After rebooting and resuming the scrub, it's going at ~80MB/s, considerably faster than before. It's 73% done at the moment, which is well beyond the point it would have OOMed in the past. I think the patch can tentatively be considered working. I will reply once more when the scrub completes, assuming it doesn't fail some time in the next 16 hours.

scan: scrub in progress since Sun Jun 12 18:39:43 2022
12.3T scanned at 82.2M/s, 11.3T issued at 70.3M/s, 15.4T total
0B repaired, 73.84% done, 16:38:21 to go

scrub had reached ~38% from Sunday until yesterday. The rest has been scrubbed in the last 22 hours since it was rebooted after power failure.

$ free
total used free shared buff/cache available
Mem: 16255844 10666684 516564 2960 5072596 5246096
Swap: 16777212 1024 16776188

$ w
15:39:48 up 22:44, 1 user, load average: 5.14, 5.04, 4.96

Only up 22h since reboot, load is about expected while doing a scrub. Latency is fine, although disk I/O is somewhat degraded, as expected.

$ zfs version
zfs-2.0.7-1_gd84d6b905
zfs-kmod-2.0.7-1_gd84d6b905

Making sure I'm still running the right patch.

dglidden · 2022-06-19T18:12:25Z

Success, it completed with no OOM!

behlendorf · 2022-06-20T18:21:54Z

That's great news. This fix is already in the master branch and will be included in the planned 2.1.5 release. Then I think we can close this out.

stale · 2023-06-23T03:28:32Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

dglidden · 2023-06-23T21:27:48Z

Confirming close. No OOMs since the original problem was fixed.

dglidden added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 10, 2022

stale bot added the Status: Stale No recent activity for issue label Jun 23, 2023

dglidden closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM on zfs scrub #13546

OOM on zfs scrub #13546

dglidden commented Jun 10, 2022 •

edited

Loading

behlendorf commented Jun 10, 2022

dglidden commented Jun 12, 2022 •

edited

Loading

dglidden commented Jun 16, 2022 •

edited

Loading

dglidden commented Jun 17, 2022

dglidden commented Jun 19, 2022

behlendorf commented Jun 20, 2022

stale bot commented Jun 23, 2023

dglidden commented Jun 23, 2023

OOM on zfs scrub #13546

OOM on zfs scrub #13546

Comments

dglidden commented Jun 10, 2022 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

System Information

behlendorf commented Jun 10, 2022

dglidden commented Jun 12, 2022 • edited Loading

dglidden commented Jun 16, 2022 • edited Loading

dglidden commented Jun 17, 2022

dglidden commented Jun 19, 2022

behlendorf commented Jun 20, 2022

stale bot commented Jun 23, 2023

dglidden commented Jun 23, 2023

dglidden commented Jun 10, 2022 •

edited

Loading

dglidden commented Jun 12, 2022 •

edited

Loading

dglidden commented Jun 16, 2022 •

edited

Loading