Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Remove the Independent DeleteFiles for the Iceberg Format Table #1628

Closed
1 of 2 tasks
wangtaohz opened this issue Jun 29, 2023 · 2 comments · Fixed by #1664 or #1825
Closed
1 of 2 tasks

[Feature]: Remove the Independent DeleteFiles for the Iceberg Format Table #1628

wangtaohz opened this issue Jun 29, 2023 · 2 comments · Fixed by #1664 or #1825
Labels
type:feature Feature Requests

Comments

@wangtaohz
Copy link
Contributor

Description

For the Iceberg Format table, there are some scenarios that produce some DeleteFiles that are not related to any DataFile , which are called Independent DeleteFiles. These scenarios include:

  • when upsert mode is enabled, a record which not exist will be deleted before inserting it
  • delete files that are not automatically cleaned up after rewriting files
  • ...

Although these Independent DeleteFiles cannot be found when scanning files and do not affect read performance, they can cause file accumulation and put pressure on the file system.

We should remove these Independent DeleteFiles periodically.

Use case/motivation

In our scenario, after several days of testing, we were able to find thousands of Independent DeleteFiles, including both equ-delete Files and pos-delete Files, even far more than the number of Data Files. Removing these files can greatly reduce the total number of files on the disk.

Describe the solution

I suggest removing Independent DeleteFiles during Orphan files cleanup because Independent DeleteFiles can be considered a special type of Orphan Files.

Of course, a table property, like clean-independent-delete-files.enabled (default true), should be introduced to control whether to clean Independent DeleteFiles for the table.

We can further discuss whether this implementation is appropriate.

Subtasks

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wangtaohz wangtaohz added the type:feature Feature Requests label Jun 29, 2023
@wangtaohz
Copy link
Contributor Author

This tool can help you scan for the independent delete files, TestScanTable.

@wangtaohz
Copy link
Contributor Author

Rename the Independent delete files to Dangling delete files, in order to be consistent with the concept of the Iceberg community. apache/iceberg#6581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature Requests
Projects
None yet
1 participant