-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM after files remove with dedup on and fast dedup enabled #16697
Comments
@robn yeah, saw those issues but decided to post on a fresh rc2. I will post /proc/spl/kmem/slab shortly. |
@robn please find the slab before starting files removal:
slab prior to OOM event:
slab after OOM:
|
@jtblck90 thanks for all the info. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today. If you're able, could you please rerun your test with this patch? Thanks! Note that this won't do anything about the |
@robn Thanks! I will test the patch and get back to you with the results. |
@robn I have tested your patch by removing 4x2TB files from a zpool with the same configuration as above. The However, once the actual space reclamation started and used size on zpool started decreasing, I monitored the zpool state with I performed the second test, but this time, I decreased the Basically, I just needed to tune the parameter above and everything worked! By the way, do you have any assumption when zfs 2.3.0 might be released and if your patch will be included in it? |
@jtblck90 we'll pull it back in to the 2.3.0 release branch once the PR is finalized and merged to master. |
@behlendorf Thank you, that's great news! Perhaps, you have some insight on when we could expect 2.3.0 full release? |
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
System information
A testing VM has 64GB of RAM and 32GB of RAM set for ZFS ARC via min and max parameters. RAIDZ1 pool with is configured with deduplication and fast dedup feature is enabled and active.
zpool status
zpool config
zfs config
zpool status with DDT (zpool status -D)
arcstat
Describe the problem you're observing
When writing large files which in my test was 4 files 2TB each (8TB total) on a zpool with dedup enabled and fast dedup feature active, all ARC is used and total RAM consumption sits at around 47GB. When deleting the files, RAM usage grows and the system goes into OOM. This can be reproduced on other recordsizes as well (tested with 16K and 128K recordsize). Also, with lower amount of data and lower amount of RAM. Same can be observed with lots of small files but with equal total space occupied on zpool. If removing small files one by one, they can be deleted, but when attempting to remove lots of 1GB files simultaneously, this results in OOM. After the reset, zpool cannot be imported resulting in the same OOM condition.
Describe how to reproduce the problem
Write several large files on zpool with deduplication and fast dedup enabled. In my experiment, this was 4x2TB files. Total RAM - 64GB. Or, 4x1TB files but with lower amount of RAM (32GB). Try to remove the files with
rm
Include any warning/errors/backtraces from the system logs
I cannot find the OOM messages after the reset in journal so attaching the screenshot here.
From the journal log, I see the following events:
Oct 29 04:45:17 zfs-rc2-test kernel: Large kmem_alloc(74904, 0x1000), please file an issue at: https://github.com/openzfs/zfs/issues/new
Attaching full journal logs and dmesg logs just in case.
log.txt
dmesg.txt
The text was updated successfully, but these errors were encountered: