WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

JonasJ-ap · 2023-01-22T05:41:21Z

This PR is under construction, but I want to put it here for some initial feedback and discussion about the conversion from Apache Hudi to Apache Iceberg

Overview

This PR aims to add a module called iceberg-hudi which contains public API and a base implementation to snapshot a hudi table to iceberg table. In expectation, the base implementation should rely on hudi-common module to extract metadata, timeline, locations of datafiles, and other information necessary for the conversion.

copy-on-write(COW) and merge-on-read (MOR) are two types of hudi table. As the initial implementation, this PR will focus on the conversion logic for COW tables.

The overall structure of the module is expected to be similar to #6449 . However, things may change as hudi is different from the delta lake. Also, due to the complexity of the conversion, I may make a proposal later for further discussion in the community.

High-level Ideas

The base implementation of the snapshot action involves schema conversion and timeline replay. The idea here is to map every completed COMMIT action on the timeline to an iceberg snapshot. The conception of COW can be mapped to the overwrite operation in iceberg. In other words, for every update of a datafile, we will delete the previous version datafile and add the newly created datafile to the iceberg table.

Need Further Investigations:

Handle [COMPACTION] action on the timeline
We may take advantage of column_stats stored in Hudi's metadata table rather than using FileIO to extract metrics from datafiles.

Dependency Issue

hudi-common:0.12.2 has dependency conflict with the hudi-spark-bundle:0.12.2, which is intended to be used for integration test. Currently use version 0.12.0 to work it around

If you have some comments or suggestions on how to convert hudi to iceberg, please feel free to share them here. Thank you in advance.

JonasJ-ap · 2023-02-08T16:34:47Z

[Curiosity] Which one is preferred for hudi-related variable names: hudixxxx or hoodiexxx

jackye1995 · 2023-02-08T16:36:47Z

[Curiosity] Which one is preferred for hudi-related variable names: hudixxxx or hoodiexxx

Hudi should be, Hoodie was the name used before the official project name

jackye1995 · 2023-02-08T20:09:31Z

Took a brief look, overall I agree with what the community discussion led to, replaying the timeline is cool but Hudi concurrent transaction has awkward behavior and we cannot guarantee it is always correct.

Instead, we can ask user to always compact before migration, such that we only need to offer the ability to migrate the latest compacted table in Hudi.

When new data come, the user can rerun compaction, and then rerun this migration action, and the action will fully replace the previous version of the migrated Iceberg table.

This approach will work for both CoW and MoR tables, and we can add timeline replay as follow up feature if necessary.

What do you think?

JonasJ-ap · 2023-02-10T20:59:45Z

Souds Good! Thank you for your suggestions. After the last community sync, I re-investigate the whole process of this demo and realize that there is no proper way to guarantee the correct order of timeline replay.

In the new proposal, the expected migration guarantee is that we always migrate the table at the state of lastest COMPACTION. In other words, we only include the most updated base file in each file group of the hudi table. In this way, users can choose to do compaction before the migration if they want the most updated table or do nothing if they do not want to include newly arrived data.

I will start to investigate the proper way to implement this.

github-actions · 2024-08-24T00:13:38Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-12T00:14:43Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

JonasJ-ap added 8 commits January 17, 2023 17:17

add test base for hudi

91ee4ac

add write data to partitioned hudi table

57f93f1

test fail

c091988

test work

dda6a1f

work out schema conversion

0ef3583

rename some methods

ced38a1

COW first draft, but currently cannot get file groups

7286cab

prepare for draft PR discussion

bbb5c36

github-actions bot added the build label Jan 22, 2023

jackye1995 self-requested a review January 22, 2023 08:05

JonasJ-ap added 9 commits January 24, 2023 22:28

fix get all file groups

44f7f69

successfully snapshot first hoodie table

43e30de

pass test for all primitive types and partition table

cb382db

find bugs when arrayType presents

62ef777

remove the need of hadoop-mr

8fef8c9

verify multiple commits

9db1ead

make arrayType possible by enforcing the new list type in parquet

94dddef

add tests refactor the base action implementation and add ci

cc0a9bb

resolve dependency issue (kind of...)

3ee5a6b

github-actions bot added the INFRA label Feb 2, 2023

JonasJ-ap added 3 commits February 7, 2023 21:42

handle multiple partition columns

fb479a6

fix bug for unpartitioned table

42f9709

checked multi partitions

dde7fcd

github-actions bot added the stale label Aug 24, 2024

github-actions bot closed this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

JonasJ-ap commented Jan 22, 2023 •

edited

Loading

JonasJ-ap commented Feb 8, 2023

jackye1995 commented Feb 8, 2023

jackye1995 commented Feb 8, 2023

JonasJ-ap commented Feb 10, 2023

github-actions bot commented Aug 24, 2024

github-actions bot commented Sep 12, 2024

WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

WIP: Support Snapshot Copy-On-Write Hudi Table to Iceberg Table #6642

Conversation

JonasJ-ap commented Jan 22, 2023 • edited Loading

Overview

High-level Ideas

Need Further Investigations:

Dependency Issue

JonasJ-ap commented Feb 8, 2023

jackye1995 commented Feb 8, 2023

jackye1995 commented Feb 8, 2023

JonasJ-ap commented Feb 10, 2023

github-actions bot commented Aug 24, 2024

github-actions bot commented Sep 12, 2024

JonasJ-ap commented Jan 22, 2023 •

edited

Loading