-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. #38432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. #38432
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39639 ms |
TPC-DS: Total hot run time: 173267 ms |
ClickBench: Total hot run time: 30.81 s |
29daddd to
41fabdf
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39625 ms |
TPC-DS: Total hot run time: 173012 ms |
ClickBench: Total hot run time: 30.48 s |
2733ba7 to
41fabdf
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
2 similar comments
|
clang-tidy review says "All clean, LGTM! 👍" |
|
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 39823 ms |
TPC-DS: Total hot run time: 172930 ms |
ClickBench: Total hot run time: 30.46 s |
bcf4eaa to
5cfcedd
Compare
|
run buildall |
|
clang-tidy review says "All clean, LGTM! 👍" |
|
PR approved by anyone and no changes requested. |
…es. (apache#38432) Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | NULL | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | 1 | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.
…es. (apache#38432) Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | NULL | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | 1 | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.
…es. (#38432) (#38809) bp #38432 ## Proposed changes Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | NULL | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | 1 | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.
…es. (apache#38432) Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | NULL | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | 1 | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.
…es. (apache#38432) Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | NULL | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ | new_a | b | +-------+------+ | 1 | 2 | | 2 | 4 | +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.
…rtition tb cause be core. (#49966) ### What problem does this PR solve? related pr : #38432 Problem Summary: when you query hive parquet format partition table, and `set hive_parquet_use_column_names = false`, maybe you will get : ``` *** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6 2# pthread_kill at ./nptl/pthread_kill.c:89 3# raise at ../sysdeps/posix/raise.c:27 4# abort at ./stdlib/abort.c:81 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265 11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586 ```` The reason is that when `get_next_block` replaces the column name, data out of bounds occurs.
…rtition tb cause be core. (#49966) ### What problem does this PR solve? related pr : #38432 Problem Summary: when you query hive parquet format partition table, and `set hive_parquet_use_column_names = false`, maybe you will get : ``` *** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6 2# pthread_kill at ./nptl/pthread_kill.c:89 3# raise at ../sysdeps/posix/raise.c:27 4# abort at ./stdlib/abort.c:81 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265 11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586 ```` The reason is that when `get_next_block` replaces the column name, data out of bounds occurs.
…rtition tb cause be core. (#49966) ### What problem does this PR solve? related pr : #38432 Problem Summary: when you query hive parquet format partition table, and `set hive_parquet_use_column_names = false`, maybe you will get : ``` *** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6 2# pthread_kill at ./nptl/pthread_kill.c:89 3# raise at ../sysdeps/posix/raise.c:27 4# abort at ./stdlib/abort.c:81 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265 11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586 ```` The reason is that when `get_next_block` replaces the column name, data out of bounds occurs.
…rtition tb cause be core. (apache#49966) ### What problem does this PR solve? related pr : apache#38432 Problem Summary: when you query hive parquet format partition table, and `set hive_parquet_use_column_names = false`, maybe you will get : ``` *** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6 2# pthread_kill at ./nptl/pthread_kill.c:89 3# raise at ../sysdeps/posix/raise.c:27 4# abort at ./stdlib/abort.c:81 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265 11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586 ```` The reason is that when `get_next_block` replaces the column name, data out of bounds occurs.
…o read files, there will be multiple threads modify same object (#50161) ### What problem does this PR solve? Related PR: #38432 Problem Summary: in pr #38432 , if parquet reader use index to reade file and file column name not eq table column name, reader will modify _colname_to_value_range . However, this object is held by multiple vfile scanners, and multi-threaded modification of this object will cause be core.
…o read files, there will be multiple threads modify same object (apache#50161) Related PR: apache#38432 Problem Summary: in pr apache#38432 , if parquet reader use index to reade file and file column name not eq table column name, reader will modify _colname_to_value_range . However, this object is held by multiple vfile scanners, and multi-threaded modification of this object will cause be core.
…o read files, there will be multiple threads modify same object (apache#50161) Related PR: apache#38432 Problem Summary: in pr apache#38432 , if parquet reader use index to reade file and file column name not eq table column name, reader will modify _colname_to_value_range . However, this object is held by multiple vfile scanners, and multi-threaded modification of this object will cause be core.
…o read files, there will be multiple threads modify same object (apache#50161) Related PR: apache#38432 Problem Summary: in pr apache#38432 , if parquet reader use index to reade file and file column name not eq table column name, reader will modify _colname_to_value_range . However, this object is held by multiple vfile scanners, and multi-threaded modification of this object will cause be core.
…rtition tb cause be core. (apache#49966) ### What problem does this PR solve? related pr : apache#38432 Problem Summary: when you query hive parquet format partition table, and `set hive_parquet_use_column_names = false`, maybe you will get : ``` *** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6 2# pthread_kill at ./nptl/pthread_kill.c:89 3# raise at ../sysdeps/posix/raise.c:27 4# abort at ./stdlib/abort.c:81 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be 10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265 11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586 ```` The reason is that when `get_next_block` replaces the column name, data out of bounds occurs.
…o read files, there will be multiple threads modify same object (apache#50161) ### What problem does this PR solve? Related PR: apache#38432 Problem Summary: in pr apache#38432 , if parquet reader use index to reade file and file column name not eq table column name, reader will modify _colname_to_value_range . However, this object is held by multiple vfile scanners, and multi-threaded modification of this object will cause be core.
Proposed changes
Add
hive_parquet_use_column_namesandhive_orc_use_column_namessession variables to read the table after rename column inHive.These two session variables are referenced from
parquet_use_column_namesandorc_use_column_namesofTrinohive connector.By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition.
For example:
You can use
set parquet.column.index.access/orc.force.positional.evolution = true/falsein hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.