-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized Large File Deletion to Prevent OOM #16708
Comments
I assume you're talking about (at least): #6783 #16037 #16697. If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example). This specific method can't be done, as |
Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes. Hope I didn't miss something. |
Yes, you’re absolutely correct. My approach to finding a solution to this issue went as follows:
That’s why I decided to share this approach with the community — to discuss possible ways to implement such a mechanism within the ZFS codebase. |
In searching for a solution to this issue, I reviewed all the issues you referenced. I understand that the problem isn’t specifically limited to deduplication; it’s broader in scope. However, in the case of deduplication, this problem is 100% reproducible and testable. That’s why I chose a more general title for this issue. |
Yep, and you can do tricks with It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files. It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure. So yeah, if controlling this way from userspace with |
I haven't looked there lately and may misremember, but IIRC we've had a mechanisms to throttle deletes to split them between transaction groups. I am not sure it may help single huge file, but for many smaller ones it would be the proper solution. |
@robn I currently have access to the same host described in my experiment, but with smaller NVMe drives. The maximum size for the ZFS pool I can create is approximately 1.09 TB, which would allow me to create a zvol of around 800-900 GB, assuming the pool is filled to 80-90%. I would be happy to assist in gathering information to tackle this broader problem. Please provide the parameters for the zvol experiment, including the zvol size and block size. I will fill it with random data and then perform a Also, please clarify what specific data you are looking to obtain from this experiment. If I understand correctly, you aim to test the hypothesis regarding the sequential discarding of blocks and its impact on memory behavior. The |
Possibly you mean the For big objects though, it just ends up adding the entire object length to Anyway, I think I have a plan now: repurpose |
@robn I am not sure what exactly I mean, but you may see that |
Ahh yeah, that might be it. And I understand why its not working here. In I'm currently looking at In the longer term, the whole zio pipeline needs a lot of work. Reducing |
@robn
Filling: blkdiscard by default, without specifying a step, discards all data. Without deduplication:
With deduplication:
|
@serjponomarev thanks for all the info. The zvol/blkdiscard test supported the theory. I've been able to reproduce in the lab, and I have a patch which should help. I'm still completing testing but I should be able to post a PR later today. If you're able, could you please rerun your test with this patch? Thanks! |
@robn Environment:
For testing, all data written was purely random to maximize the deduplication table. I tested both native deduplication and fast-deduplication methods, performing deletions of a 900 GB file and discarding a 900 GB zvol. This resulted in four test cases in total. Additionally, the To test it further, I increased this parameter by 10x and then reduced it by 10x:
I monitored the behavior of this parameter through |
@serjponomarev this is fantastic info. You've confirmed pretty much exactly what I was hoping it would do, in more ways than I thought of. Thanks so much! PR already posted in #16722. |
@robn Maybe I am missing something, but is |
@shodanshok I don't believe so. As I understand it, that sets the threshold for how free can go on a single txg, but it doesn't split or anything, just doesn't allow any more once you've gone past it. So if you decide to put a single 2T "free range" on a txg, it goes in and no more will be allowed, but by then its too late. |
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
Describe the feature you would like to see added to OpenZFS
I propose adding an iterative approach for deleting large files in ZFS pools with deduplication enabled. Instead of calling
unlink
to remove the entire file at once, we can implement a mechanism that reduces the file size from the end, freeing blocks incrementally.How will this feature improve OpenZFS?
This feature addresses the issue of Out-Of-Memory (OOM) errors that occur when deleting large files. Currently, when
unlink
is called, ZFS loads all entries from the Deduplication Data Table (DDT) related to the file into memory, which can lead to memory overload, especially on systems with limited RAM. By implementing an iterative file reduction process, we can significantly reduce memory consumption and improve stability.Additional context
The proposed algorithm includes the following steps:
unlink
Call: Once the file is completely truncated, perform a finalunlink
to remove any remaining metadata.Benefits:
Experimental Evidence
The following experiment demonstrates the basis for this proposed improvement:
Environment:
recordsize=16K
.Procedure:
Populate the pool with a file containing random data to fully utilize the DDT:
Attempt to delete the file using
rm /zpool/test.io
, resulting in an OOM event.Reboot and delete the file iteratively, reducing its size by 1 GB in each iteration before final deletion:
Observation:
Memory consumption can be monitored with
watch arc_summary
throughout the process.The text was updated successfully, but these errors were encountered: