-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](hudi)support native read hudi top level schema change table. #49051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 32485 ms |
TPC-DS: Total hot run time: 192091 ms |
ClickBench: Total hot run time: 31.49 s |
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 32527 ms |
TPC-DS: Total hot run time: 193035 ms |
ClickBench: Total hot run time: 31.45 s |
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 32573 ms |
TPC-DS: Total hot run time: 191672 ms |
ClickBench: Total hot run time: 31.87 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
| std::vector<uint64_t>* col_attributes, | ||
| std::string attribute) { | ||
| std::vector<int32_t>* col_attributes, | ||
| std::string attribute, bool& exist_attribute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::string attribute, bool& exist_attribute) { | |
| std::string& attribute, bool* exist_attribute) { |
Better use pointer for passout param
| // Used in from_thrift, marking the next schema position that should be parsed | ||
| size_t _next_schema_pos; | ||
| std::unordered_map<uint64_t, std::string> _field_id_name_mapping; | ||
| std::map<int32_t, std::string> _field_id_name_mapping; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change to std::map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be compatible with TableSchemaChangeHelper interface.
| return Status::OK(); | ||
| } | ||
|
|
||
| Status TableSchemaChangeHelper::get_next_block_after(Block* block) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a timer here, to see how it may cost when we face to schema change situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary. It doesn't bring much performance overhead. It just replaces the name of the column.
| } | ||
| }; | ||
|
|
||
| class TableSchemaChangeHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need unit test for this class
| super(schema, partitionColumns); | ||
| this.enableSchemaEvolution = enableSchemaEvolution; | ||
| if (enableSchemaEvolution) { | ||
| historySchemaCache = InternalSchemaCache.getHistoricalSchemas(hudiClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support to get this historySchemaCache outside the HudiSchemaCacheValue's contructor.
Let this cache value simple
| } else { | ||
| HudiSchemaCacheValue hudiSchemaCacheValue = HudiUtils.getSchemaCacheValue(hmsTable); | ||
| if (hudiSchemaCacheValue.isEnableSchemaEvolution()) { | ||
| long commitInstantTime = Long.parseLong(FSUtils.getCommitTime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too heavy to call this for each split
|
run buildall |
|
TeamCity cloud ut coverage result: |
TPC-H: Total hot run time: 32616 ms |
…able. (apache#49051) ### What problem does this PR solve? Similar to pr apache#48723 Problem Summary: 1. Supports native reader reading tables after the top-level schema of hudi is changed, but does not support tables after the internal schema of struct is changed. change internal schema of struct schema(not support, will support in the next PR). 2. Unify the logic of iceberg/paimon/hudi native reader to handle schema change's table.
…able. (apache#49051) Similar to pr apache#48723 Problem Summary: 1. Supports native reader reading tables after the top-level schema of hudi is changed, but does not support tables after the internal schema of struct is changed. change internal schema of struct schema(not support, will support in the next PR). 2. Unify the logic of iceberg/paimon/hudi native reader to handle schema change's table.
…schema changes. (#51341) ### What problem does this PR solve? Related PR: #49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) ### What problem does this PR solve? Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) ### What problem does this PR solve? Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…schema changes. (apache#51341) Related PR: apache#49051 Problem Summary: Support reading Hudi and Paimon Iceberg tables after the internal schema of struct is changed. 1. Introduce `hive_reader` to avoid confusion between `hive` and `parquet/orc` reader 2. Before this, support for reading tables after schema changes of ordinary columns relied on changing the column name in block, so that parquet/orc reader can read specific file columns when `get_next_block`, and `hudi/iceberg/paimon reader` will mix `file column names` with `table column names` when using parquet/orc reader. This pr clarifies that all calls to `parquet/orc reader` are based on the concept of `table column names`, and then introduces `TableSchemaChangeHelper::Node` to help `parquet/orc reader` find the specific file columns to be read.
…rg.id orc file.(#49051) (#54167) ### What problem does this PR solve? pick #49051 but only fix: ``` terminate called after throwing an instance of 'std::range_error' what(): Key not found: iceberg.id *** Query id: 6a93d7cdc9f44370-a40b07934a14c81b *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1753842428 (unix time) try "date -d @1753842428" if you are using GNU date *** *** Current BE git commitID: 910c424 *** *** SIGABRT unknown detail explain (@0x5a46f) received by PID 369775 (TID 371694 OR 0x7fad067ef640) from PID 369775; stack trace: *** terminate called recursively terminate called recursively terminate called recursively terminate called recursively 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421 1# 0x00007FB12263EBF0 in /lib64/libc.so.6 2# __pthread_kill_implementation in /lib64/libc.so.6 3# gsignal in /lib64/libc.so.6 4# abort in /lib64/libc.so.6 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C047B28EC1 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 8# 0x000055C047B29014 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 9# orc::TypeImpl::getAttributeValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 10# doris::vectorized::OrcReader::get_schema_col_name_attribute(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::vector<unsigned long, std::allocator<unsigned long> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/format/orc/vorc_reader.cpp:332 11# doris::vectorized::IcebergOrcReader::_gen_col_name_maps(doris::vectorized::OrcReader*) at ```
…rg.id orc file.(apache#49051) (apache#54167) pick apache#49051 but only fix: ``` terminate called after throwing an instance of 'std::range_error' what(): Key not found: iceberg.id *** Query id: 6a93d7cdc9f44370-a40b07934a14c81b *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1753842428 (unix time) try "date -d @1753842428" if you are using GNU date *** *** Current BE git commitID: 910c424 *** *** SIGABRT unknown detail explain (@0x5a46f) received by PID 369775 (TID 371694 OR 0x7fad067ef640) from PID 369775; stack trace: *** terminate called recursively terminate called recursively terminate called recursively terminate called recursively 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421 1# 0x00007FB12263EBF0 in /lib64/libc.so.6 2# __pthread_kill_implementation in /lib64/libc.so.6 3# gsignal in /lib64/libc.so.6 4# abort in /lib64/libc.so.6 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C047B28EC1 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 8# 0x000055C047B29014 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 9# orc::TypeImpl::getAttributeValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 10# doris::vectorized::OrcReader::get_schema_col_name_attribute(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::vector<unsigned long, std::allocator<unsigned long> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/format/orc/vorc_reader.cpp:332 11# doris::vectorized::IcebergOrcReader::_gen_col_name_maps(doris::vectorized::OrcReader*) at ```
…rg.id orc file.(apache#49051) (apache#54167) pick apache#49051 but only fix: ``` terminate called after throwing an instance of 'std::range_error' what(): Key not found: iceberg.id *** Query id: 6a93d7cdc9f44370-a40b07934a14c81b *** *** is nereids: 1 *** *** tablet id: 0 *** *** Aborted at 1753842428 (unix time) try "date -d @1753842428" if you are using GNU date *** *** Current BE git commitID: 910c424 *** *** SIGABRT unknown detail explain (@0x5a46f) received by PID 369775 (TID 371694 OR 0x7fad067ef640) from PID 369775; stack trace: *** terminate called recursively terminate called recursively terminate called recursively terminate called recursively 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421 1# 0x00007FB12263EBF0 in /lib64/libc.so.6 2# __pthread_kill_implementation in /lib64/libc.so.6 3# gsignal in /lib64/libc.so.6 4# abort in /lib64/libc.so.6 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C047B28EC1 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 8# 0x000055C047B29014 in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 9# orc::TypeImpl::getAttributeValue(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const in /opt/apache-doris-3.0.6.2-bin-x64/be/lib/doris_be 10# doris::vectorized::OrcReader::get_schema_col_name_attribute(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, std::vector<unsigned long, std::allocator<unsigned long> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/format/orc/vorc_reader.cpp:332 11# doris::vectorized::IcebergOrcReader::_gen_col_name_maps(doris::vectorized::OrcReader*) at ```
What problem does this PR solve?
Similar to pr #48723
Problem Summary:
Supports native reader reading tables after the top-level schema of hudi is changed, but does not support tables after the internal schema of struct is changed.
change internal schema of struct schema(not support, will support in the next PR).
Unify the logic of iceberg/paimon/hudi native reader to handle schema change's table.
Release note
Supports native reader reading tables after the top-level schema of hudi is changed.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)