-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](external) Support reading Hudi/Paimon/Iceberg tables after schema changes. #51341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33718 ms |
TPC-DS: Total hot run time: 192968 ms |
ClickBench: Total hot run time: 29.76 s |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 33771 ms |
TPC-DS: Total hot run time: 185508 ms |
ClickBench: Total hot run time: 28.99 s |
8b8c15d to
ac0900b
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 33804 ms |
TPC-DS: Total hot run time: 185788 ms |
ClickBench: Total hot run time: 29.33 s |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33787 ms |
TPC-DS: Total hot run time: 185136 ms |
ClickBench: Total hot run time: 28.38 s |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33931 ms |
TPC-DS: Total hot run time: 184797 ms |
ClickBench: Total hot run time: 29.5 s |
morningman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
kaka11chen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…schema changes. (apache#51341) ### What problem does this PR solve? Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) ### What problem does this PR solve? Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
### What problem does this PR solve? Related PR: #51341 Problem Summary: In pr #51341, hudiOrcReader was deleted, and this pr reintroduced it to read hudi orc table. Although I encountered this error when testing spark-hudi to read orc, the orc file was indeed generated by spark-hudi. ``` java.lang.UnsupportedOperationException: Base file format is not currently supported (ORC) at org.apache.hudi.HoodieBaseRelation.createBaseFileReader(HoodieBaseRelation.scala:574) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.BaseFileOnlyRelation.composeRDD(BaseFileOnlyRelation.scala:96) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:381) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:329) ~[spark-sql_2.12-3.4.2.jar:0.14.0-1] ```
…on version. (#53055) ### What problem does this PR solve? Related PR: #51341 Problem Summary: In PR #51341, the Docker Paimon was upgraded from version 0.8 to 1.0.1. Since the required JAR files are pulled from a Maven repository, some machines may not be able to access the repository. To fix this, the JAR file has been uploaded to object storage, ensuring that it can be reliably accessed across different environments.
…schema changes. (apache#51341) Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…e#52964) Related PR: apache#51341 Problem Summary: In pr apache#51341, hudiOrcReader was deleted, and this pr reintroduced it to read hudi orc table. Although I encountered this error when testing spark-hudi to read orc, the orc file was indeed generated by spark-hudi. ``` java.lang.UnsupportedOperationException: Base file format is not currently supported (ORC) at org.apache.hudi.HoodieBaseRelation.createBaseFileReader(HoodieBaseRelation.scala:574) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.BaseFileOnlyRelation.composeRDD(BaseFileOnlyRelation.scala:96) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:381) ~[hudi-spark3.4-bundle_2.12-0.14.0-1.jar:0.14.0-1] at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:329) ~[spark-sql_2.12-3.4.2.jar:0.14.0-1] ```
…on version. (apache#53055) ### What problem does this PR solve? Related PR: apache#51341 Problem Summary: In PR apache#51341, the Docker Paimon was upgraded from version 0.8 to 1.0.1. Since the required JAR files are pulled from a Maven repository, some machines may not be able to access the repository. To fix this, the JAR file has been uploaded to object storage, ensuring that it can be reliably accessed across different environments.
…on version. (apache#53055) Related PR: apache#51341 Problem Summary: In PR apache#51341, the Docker Paimon was upgraded from version 0.8 to 1.0.1. Since the required JAR files are pulled from a Maven repository, some machines may not be able to access the repository. To fix this, the JAR file has been uploaded to object storage, ensuring that it can be reliably accessed across different environments.
What problem does this PR solve?
Related PR: #49051
Problem Summary:
Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed.
hive_readerto avoid confusion betweenhiveandparquet/orcreaderget_next_block, andhudi/iceberg/paimon readerwill mixfile column nameswithtable column nameswhen using parquet/orc reader.This pr clarifies that all calls to
parquet/orc readerare based on the concept oftable column names, and then introducesTableSchemaChangeHelper::Nodeto helpparquet/orc readerfind the specific file columns to be read.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)