Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

Closed

Conversation

JonasJ-ap
Copy link
Contributor

@JonasJ-ap JonasJ-ap commented Jan 22, 2023

This PR is under construction, but I want to put it here for some initial feedback and discussion about the conversion from Apache Hudi to Apache Iceberg

Overview

This PR aims to add a module called iceberg-hudi which contains public API and a base implementation to snapshot a hudi table to iceberg table. In expectation, the base implementation should rely on hudi-common module to extract metadata, timeline, locations of datafiles, and other information necessary for the conversion.

copy-on-write(COW) and merge-on-read (MOR) are two types of hudi table. As the initial implementation, this PR will focus on the conversion logic for COW tables.

The overall structure of the module is expected to be similar to #6449 . However, things may change as hudi is different from the delta lake. Also, due to the complexity of the conversion, I may make a proposal later for further discussion in the community.

High-level Ideas

The base implementation of the snapshot action involves schema conversion and timeline replay. The idea here is to map every completed COMMIT action on the timeline to an iceberg snapshot. The conception of COW can be mapped to the overwrite operation in iceberg. In other words, for every update of a datafile, we will delete the previous version datafile and add the newly created datafile to the iceberg table.

Need Further Investigations:

  1. Handle [COMPACTION] action on the timeline
  2. We may take advantage of column_stats stored in Hudi's metadata table rather than using FileIO to extract metrics from datafiles.

Dependency Issue

  1. hudi-common:0.12.2 has dependency conflict with the hudi-spark-bundle:0.12.2, which is intended to be used for integration test. Currently use version 0.12.0 to work it around

If you have some comments or suggestions on how to convert hudi to iceberg, please feel free to share them here. Thank you in advance.

@github-actions github-actions bot added the build label Jan 22, 2023
@jackye1995 jackye1995 self-requested a review January 22, 2023 08:05
@github-actions github-actions bot added the INFRA label Feb 2, 2023
@JonasJ-ap
Copy link
Contributor Author

[Curiosity] Which one is preferred for hudi-related variable names: hudixxxx or hoodiexxx

@jackye1995
Copy link
Contributor

[Curiosity] Which one is preferred for hudi-related variable names: hudixxxx or hoodiexxx

Hudi should be, Hoodie was the name used before the official project name

@jackye1995
Copy link
Contributor

Took a brief look, overall I agree with what the community discussion led to, replaying the timeline is cool but Hudi concurrent transaction has awkward behavior and we cannot guarantee it is always correct.

Instead, we can ask user to always compact before migration, such that we only need to offer the ability to migrate the latest compacted table in Hudi.

When new data come, the user can rerun compaction, and then rerun this migration action, and the action will fully replace the previous version of the migrated Iceberg table.

This approach will work for both CoW and MoR tables, and we can add timeline replay as follow up feature if necessary.

What do you think?

@JonasJ-ap
Copy link
Contributor Author

Souds Good! Thank you for your suggestions. After the last community sync, I re-investigate the whole process of this demo and realize that there is no proper way to guarantee the correct order of timeline replay.

In the new proposal, the expected migration guarantee is that we always migrate the table at the state of lastest COMPACTION. In other words, we only include the most updated base file in each file group of the hudi table. In this way, users can choose to do compaction before the migration if they want the most updated table or do nothing if they do not want to include newly arrived data.

I will start to investigate the proper way to implement this.

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 24, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants