-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642
WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642
Conversation
[Curiosity] Which one is preferred for hudi-related variable names: |
Hudi should be, Hoodie was the name used before the official project name |
Took a brief look, overall I agree with what the community discussion led to, replaying the timeline is cool but Hudi concurrent transaction has awkward behavior and we cannot guarantee it is always correct. Instead, we can ask user to always compact before migration, such that we only need to offer the ability to migrate the latest compacted table in Hudi. When new data come, the user can rerun compaction, and then rerun this migration action, and the action will fully replace the previous version of the migrated Iceberg table. This approach will work for both CoW and MoR tables, and we can add timeline replay as follow up feature if necessary. What do you think? |
Souds Good! Thank you for your suggestions. After the last community sync, I re-investigate the whole process of this demo and realize that there is no proper way to guarantee the correct order of timeline replay. In the new proposal, the expected migration guarantee is that we always migrate the table at the state of lastest COMPACTION. In other words, we only include the most updated base file in each file group of the hudi table. In this way, users can choose to do compaction before the migration if they want the most updated table or do nothing if they do not want to include newly arrived data. I will start to investigate the proper way to implement this. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This PR is under construction, but I want to put it here for some initial feedback and discussion about the conversion from Apache Hudi to Apache Iceberg
Overview
This PR aims to add a module called
iceberg-hudi
which contains public API and a base implementation to snapshot a hudi table to iceberg table. In expectation, the base implementation should rely onhudi-common
module to extract metadata, timeline, locations of datafiles, and other information necessary for the conversion.copy-on-write
(COW) andmerge-on-read
(MOR) are two types of hudi table. As the initial implementation, this PR will focus on the conversion logic for COW tables.The overall structure of the module is expected to be similar to #6449 . However, things may change as hudi is different from the delta lake. Also, due to the complexity of the conversion, I may make a proposal later for further discussion in the community.
High-level Ideas
The base implementation of the snapshot action involves schema conversion and timeline replay. The idea here is to map every completed
COMMIT
action on the timeline to an iceberg snapshot. The conception ofCOW
can be mapped to theoverwrite
operation in iceberg. In other words, for every update of a datafile, we willdelete
the previous version datafile andadd
the newly created datafile to the iceberg table.Need Further Investigations:
COMPACTION
] action on the timelineDependency Issue
hudi-common:0.12.2
has dependency conflict with thehudi-spark-bundle:0.12.2
, which is intended to be used for integration test. Currently use version 0.12.0 to work it aroundIf you have some comments or suggestions on how to convert hudi to iceberg, please feel free to share them here. Thank you in advance.