Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes RemoveOrphanFiles delete files unexpected #2890

Closed
wants to merge 4 commits into from

Conversation

ConeyLiu
Copy link
Contributor

RemoveOrphanFiles use actualFileDF leftanti join validFileDF to determine which files should be removed. We will list all the files under the provided or table location directory with FileSystem.listStatus and create the actualFileDF. validFileDF is created by index those metadata file and reference.

However, FileSystem.listStatus will qualify the given path. For example: a path: hdfs:/path will be qualified with hdfs://host:port/path. If the warehouse is set as: hdfs:/path:

validFileDF:
hdfs:/path/file1
hdfs:/path/file2
hdfs:/path/file3
....

actualFileDF:
hdfs://host:port/path/file1
hdfs://host:port/path/file2
hdfs://host:port/path/file3
....

Then, all the files in actualFileDF will be treated as invalid.

In this patch, we only compare the pure path (remove the schema and authority) when doing the leftanti join.

Updated existed UTs to test it.

ConeyLiu added 2 commits July 29, 2021 15:00
Signed-off-by: Xianyang Liu <xianyangliu@tencent.com>
@aokolnychyi
Copy link
Contributor

There have been multiple discussions around this. I'll try to fetch the old thread on slack later today.

@ConeyLiu
Copy link
Contributor Author

Hi @aokolnychyi, could you help to review this? Thanks a lot.

@ConeyLiu
Copy link
Contributor Author

gentle ping @rdblue @aokolnychyi, could you help to review this? Thanks a lot.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 18, 2024
* File path: mock://localhost:9000/path will be resolved as mock://localhost:9000/path
*
*/
public class MockFileSystem extends RawLocalFileSystem {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this store data in memory?

@findepi
Copy link
Member

findepi commented Jul 18, 2024

In this patch, we only compare the pure path (remove the schema and authority) when doing the leftanti join.

This sounds reasonable to me, but there is one caveat.
The table files may span multiple buckets, so comparing without schema and authority may not be quite correct.
I think it's fine because

  • in worst case we don't clean some orphan files, which isn't unsafe
  • for tables spanning multiple locations, does orphan removal work at all?

but someone else should confirm that.

@github-actions github-actions bot removed the stale label Jul 19, 2024
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 21, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants