Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Jul 26, 2024

Proposed changes

Add hive_parquet_use_column_names and hive_orc_use_column_names session variables to read the table after rename column in Hive.

These two session variables are referenced from parquet_use_column_names and orc_use_column_names of Trino hive connector.

By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition.

For example:

in Hive : 
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris : 
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

You can use set parquet.column.index.access/orc.force.positional.evolution = true/false in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39639 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 29dadddc119b8f8a21b4bfdeb6a79104f8373165, data reload: false

------ Round 1 ----------------------------------
q1	18210	5137	4283	4283
q2	2016	202	195	195
q3	10502	1164	1144	1144
q4	10155	737	717	717
q5	7514	2709	2672	2672
q6	219	136	138	136
q7	961	601	595	595
q8	9212	1911	1941	1911
q9	8828	6601	6613	6601
q10	8709	3801	3804	3801
q11	521	248	256	248
q12	397	230	223	223
q13	18936	3005	3022	3005
q14	286	234	241	234
q15	517	485	484	484
q16	492	414	394	394
q17	991	706	676	676
q18	8033	7453	7448	7448
q19	4766	1075	1045	1045
q20	666	341	340	340
q21	5027	3193	3269	3193
q22	366	294	295	294
Total cold run time: 117324 ms
Total hot run time: 39639 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4503	4262	4234	4234
q2	368	285	281	281
q3	3010	2884	2897	2884
q4	2035	1722	1723	1722
q5	5646	5549	5505	5505
q6	227	132	140	132
q7	2167	1827	1864	1827
q8	3283	3427	3441	3427
q9	8728	8872	8941	8872
q10	4066	3867	3775	3775
q11	603	503	501	501
q12	820	655	693	655
q13	16345	3165	3184	3165
q14	332	303	288	288
q15	535	486	496	486
q16	502	455	469	455
q17	1838	1537	1509	1509
q18	8140	7941	7704	7704
q19	1717	1562	1639	1562
q20	2921	1890	1885	1885
q21	5042	4906	4784	4784
q22	588	508	524	508
Total cold run time: 73416 ms
Total hot run time: 56161 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173267 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 29dadddc119b8f8a21b4bfdeb6a79104f8373165, data reload: false

query1	917	371	375	371
query2	6467	1915	1893	1893
query3	6630	202	211	202
query4	27673	17716	17600	17600
query5	3621	499	484	484
query6	276	181	169	169
query7	4582	290	296	290
query8	260	197	192	192
query9	8524	2465	2444	2444
query10	432	295	274	274
query11	10859	9989	9979	9979
query12	122	83	84	83
query13	1620	370	356	356
query14	10271	7841	7640	7640
query15	215	170	160	160
query16	7624	448	443	443
query17	1586	552	527	527
query18	1806	277	278	277
query19	191	141	142	141
query20	92	83	81	81
query21	206	101	105	101
query22	4403	4056	4004	4004
query23	34123	33445	33773	33445
query24	11281	2978	2900	2900
query25	630	410	381	381
query26	1124	156	148	148
query27	2298	272	283	272
query28	6735	2094	2086	2086
query29	837	434	428	428
query30	259	158	153	153
query31	983	796	739	739
query32	92	54	52	52
query33	767	358	324	324
query34	934	476	486	476
query35	904	760	764	760
query36	1147	969	929	929
query37	142	81	88	81
query38	2860	2747	2738	2738
query39	872	789	815	789
query40	207	123	113	113
query41	48	45	46	45
query42	127	96	102	96
query43	493	472	466	466
query44	1166	733	720	720
query45	207	177	180	177
query46	1100	740	743	740
query47	1877	1777	1760	1760
query48	366	290	296	290
query49	841	406	406	406
query50	796	405	404	404
query51	6792	6627	6657	6627
query52	100	87	90	87
query53	264	182	179	179
query54	914	438	439	438
query55	75	72	74	72
query56	283	274	276	274
query57	1122	1036	1062	1036
query58	270	282	273	273
query59	2923	2681	2686	2681
query60	315	278	290	278
query61	95	94	100	94
query62	795	639	633	633
query63	208	184	178	178
query64	9472	2251	1690	1690
query65	3207	3108	3107	3107
query66	748	325	326	325
query67	15266	15183	15074	15074
query68	4519	537	548	537
query69	454	308	304	304
query70	1168	1057	1108	1057
query71	371	272	280	272
query72	6898	5523	5850	5523
query73	740	323	322	322
query74	6052	5705	5656	5656
query75	3352	2672	2673	2672
query76	2218	913	917	913
query77	439	294	293	293
query78	9642	9938	8897	8897
query79	2275	503	513	503
query80	1511	473	479	473
query81	604	219	217	217
query82	725	136	138	136
query83	282	170	173	170
query84	244	91	77	77
query85	1684	317	297	297
query86	477	315	326	315
query87	3277	3066	3070	3066
query88	3895	2453	2461	2453
query89	389	288	290	288
query90	1715	192	195	192
query91	127	99	99	99
query92	61	50	48	48
query93	2702	529	535	529
query94	801	297	284	284
query95	356	264	263	263
query96	607	277	282	277
query97	3195	3041	3026	3026
query98	231	256	195	195
query99	1658	1274	1238	1238
Total cold run time: 277111 ms
Total hot run time: 173267 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.81 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 29dadddc119b8f8a21b4bfdeb6a79104f8373165, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.06	0.05
query4	1.67	0.08	0.08
query5	0.50	0.47	0.48
query6	1.13	0.72	0.73
query7	0.01	0.01	0.02
query8	0.05	0.04	0.05
query9	0.56	0.50	0.49
query10	0.56	0.54	0.54
query11	0.15	0.11	0.12
query12	0.15	0.13	0.12
query13	0.60	0.58	0.58
query14	0.75	0.80	0.77
query15	0.86	0.82	0.81
query16	0.37	0.36	0.39
query17	1.01	0.98	0.95
query18	0.23	0.22	0.22
query19	1.79	1.66	1.69
query20	0.02	0.01	0.01
query21	15.43	0.76	0.65
query22	4.10	7.41	2.28
query23	18.26	1.30	1.33
query24	2.06	0.23	0.23
query25	0.16	0.09	0.08
query26	0.31	0.20	0.21
query27	0.46	0.23	0.23
query28	13.36	1.01	1.00
query29	12.54	3.31	3.27
query30	0.25	0.06	0.05
query31	2.86	0.39	0.39
query32	3.27	0.48	0.47
query33	2.92	2.93	2.87
query34	17.10	4.28	4.38
query35	4.41	4.40	4.38
query36	0.66	0.50	0.50
query37	0.19	0.16	0.16
query38	0.15	0.15	0.15
query39	0.05	0.04	0.03
query40	0.14	0.13	0.12
query41	0.10	0.05	0.04
query42	0.05	0.04	0.05
query43	0.05	0.03	0.03
Total cold run time: 109.63 s
Total hot run time: 30.81 s

@hubgeter hubgeter force-pushed the feature_read_hive_rename_table_parquet_orc branch from 29daddd to 41fabdf Compare July 27, 2024 15:37
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39625 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2733ba7049991873613e3630276fff0cc0d77501, data reload: false

------ Round 1 ----------------------------------
q1	18377	4451	4309	4309
q2	2020	203	198	198
q3	10504	1206	1143	1143
q4	10144	722	710	710
q5	7605	2754	2714	2714
q6	225	141	141	141
q7	972	598	595	595
q8	9217	1936	1961	1936
q9	8899	6584	6554	6554
q10	8890	3833	3780	3780
q11	464	251	260	251
q12	404	229	226	226
q13	18760	2991	3004	2991
q14	289	229	245	229
q15	520	468	484	468
q16	524	390	381	381
q17	1005	626	717	626
q18	8176	7531	7388	7388
q19	4895	1061	1105	1061
q20	678	337	349	337
q21	4906	3302	3315	3302
q22	355	287	285	285
Total cold run time: 117829 ms
Total hot run time: 39625 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4508	4251	4291	4251
q2	366	276	269	269
q3	3014	2890	2904	2890
q4	2028	1713	1756	1713
q5	5700	5598	5571	5571
q6	224	148	140	140
q7	2231	1901	1843	1843
q8	3299	3437	3778	3437
q9	8909	8893	8907	8893
q10	4147	3807	3838	3807
q11	600	499	507	499
q12	818	662	674	662
q13	16122	3136	3214	3136
q14	308	296	287	287
q15	524	496	492	492
q16	485	452	458	452
q17	1847	1522	1510	1510
q18	8193	7994	7824	7824
q19	1799	1673	1509	1509
q20	2929	1906	1861	1861
q21	6453	5014	4750	4750
q22	667	496	490	490
Total cold run time: 75171 ms
Total hot run time: 56286 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173012 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2733ba7049991873613e3630276fff0cc0d77501, data reload: false

query1	917	377	361	361
query2	6446	2006	1819	1819
query3	6658	205	218	205
query4	28471	17550	17548	17548
query5	3695	491	493	491
query6	289	189	160	160
query7	4571	285	292	285
query8	237	196	194	194
query9	8570	2465	2434	2434
query10	447	301	271	271
query11	11744	10090	10083	10083
query12	116	85	84	84
query13	1633	377	368	368
query14	10072	7717	6961	6961
query15	224	167	169	167
query16	7076	498	469	469
query17	945	568	568	568
query18	1934	295	297	295
query19	196	144	144	144
query20	92	90	86	86
query21	215	99	103	99
query22	4180	3957	3849	3849
query23	34155	34018	33722	33722
query24	11551	2999	2911	2911
query25	637	393	388	388
query26	1244	152	158	152
query27	2416	285	283	283
query28	6915	2101	2068	2068
query29	928	431	425	425
query30	264	152	154	152
query31	971	798	761	761
query32	99	59	58	58
query33	794	347	332	332
query34	884	486	498	486
query35	909	732	751	732
query36	1116	945	931	931
query37	157	84	84	84
query38	2986	2886	2804	2804
query39	923	883	856	856
query40	212	120	120	120
query41	45	46	46	46
query42	109	110	101	101
query43	518	460	464	460
query44	1249	723	731	723
query45	207	177	176	176
query46	1091	725	738	725
query47	1831	1748	1734	1734
query48	374	294	290	290
query49	864	413	422	413
query50	790	404	400	400
query51	6784	6693	6700	6693
query52	110	85	94	85
query53	259	183	190	183
query54	905	444	439	439
query55	75	72	75	72
query56	301	276	270	270
query57	1203	1026	1045	1026
query58	255	277	273	273
query59	2828	2600	2818	2600
query60	316	282	286	282
query61	97	96	94	94
query62	807	656	659	656
query63	207	183	177	177
query64	9517	2298	1704	1704
query65	3485	3155	3142	3142
query66	971	326	323	323
query67	15412	15080	14779	14779
query68	4548	515	533	515
query69	554	323	314	314
query70	1122	1111	1106	1106
query71	441	280	278	278
query72	8435	5626	5676	5626
query73	765	336	326	326
query74	6078	5641	5714	5641
query75	3598	2678	2693	2678
query76	2561	942	973	942
query77	635	346	299	299
query78	10593	10034	9037	9037
query79	8469	535	516	516
query80	1516	509	509	509
query81	593	224	221	221
query82	1408	138	129	129
query83	343	177	178	177
query84	273	80	79	79
query85	1843	317	307	307
query86	328	329	294	294
query87	3233	3054	3057	3054
query88	4778	2485	2477	2477
query89	423	285	285	285
query90	1837	194	200	194
query91	126	103	160	103
query92	66	48	49	48
query93	5715	530	540	530
query94	700	263	282	263
query95	345	264	267	264
query96	632	281	273	273
query97	3171	2998	3044	2998
query98	223	202	193	193
query99	1591	1264	1291	1264
Total cold run time: 293442 ms
Total hot run time: 173012 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.48 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2733ba7049991873613e3630276fff0cc0d77501, data reload: false

query1	0.04	0.04	0.03
query2	0.08	0.04	0.04
query3	0.23	0.06	0.06
query4	1.66	0.08	0.08
query5	0.51	0.51	0.48
query6	1.13	0.73	0.73
query7	0.02	0.01	0.01
query8	0.06	0.04	0.04
query9	0.56	0.50	0.48
query10	0.55	0.55	0.54
query11	0.16	0.11	0.12
query12	0.15	0.11	0.12
query13	0.60	0.58	0.59
query14	0.77	0.77	0.77
query15	0.86	0.81	0.82
query16	0.38	0.37	0.37
query17	0.96	1.01	1.00
query18	0.23	0.22	0.21
query19	1.86	1.74	1.76
query20	0.01	0.01	0.01
query21	15.66	0.80	0.68
query22	4.25	8.30	1.77
query23	18.28	1.45	1.26
query24	2.17	0.23	0.22
query25	0.16	0.09	0.09
query26	0.30	0.22	0.21
query27	0.46	0.24	0.23
query28	13.23	1.02	1.00
query29	12.64	3.25	3.29
query30	0.25	0.06	0.06
query31	2.85	0.39	0.39
query32	3.27	0.48	0.47
query33	2.92	2.88	2.98
query34	17.08	4.33	4.36
query35	4.41	4.38	4.40
query36	0.66	0.47	0.49
query37	0.18	0.16	0.15
query38	0.15	0.16	0.16
query39	0.05	0.04	0.04
query40	0.15	0.12	0.12
query41	0.10	0.04	0.04
query42	0.07	0.04	0.06
query43	0.05	0.04	0.04
Total cold run time: 110.16 s
Total hot run time: 30.48 s

@hubgeter hubgeter force-pushed the feature_read_hive_rename_table_parquet_orc branch from 2733ba7 to 41fabdf Compare July 28, 2024 10:19
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39823 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bcf4eaa6d9442783a8ff689f5c6ee38449481871, data reload: false

------ Round 1 ----------------------------------
q1	18030	5188	4475	4475
q2	2551	205	204	204
q3	11764	1255	1156	1156
q4	10409	796	734	734
q5	7578	2730	2863	2730
q6	224	143	141	141
q7	989	606	611	606
q8	9278	1937	1959	1937
q9	8957	6627	6661	6627
q10	8723	3834	3790	3790
q11	464	242	254	242
q12	405	219	215	215
q13	17737	2985	2996	2985
q14	287	235	242	235
q15	529	483	498	483
q16	492	398	384	384
q17	978	662	726	662
q18	8209	7340	7436	7340
q19	1393	1057	989	989
q20	694	323	345	323
q21	5012	3281	3339	3281
q22	346	298	284	284
Total cold run time: 115049 ms
Total hot run time: 39823 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4315	4290	4278	4278
q2	368	270	268	268
q3	2992	2770	2742	2742
q4	1938	1648	1617	1617
q5	5310	5346	5321	5321
q6	224	128	131	128
q7	2119	1737	1764	1737
q8	3202	3374	3337	3337
q9	8452	8423	8459	8423
q10	3924	3736	3713	3713
q11	610	524	488	488
q12	747	588	589	588
q13	17434	2960	2966	2960
q14	292	268	286	268
q15	523	469	482	469
q16	479	419	423	419
q17	1806	1517	1461	1461
q18	7606	7582	7617	7582
q19	1683	1580	1444	1444
q20	1964	1775	1758	1758
q21	4964	4722	4674	4674
q22	577	478	492	478
Total cold run time: 71529 ms
Total hot run time: 54153 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172930 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bcf4eaa6d9442783a8ff689f5c6ee38449481871, data reload: false

query1	923	374	355	355
query2	6440	1943	1861	1861
query3	6663	206	215	206
query4	28266	17713	17576	17576
query5	4203	497	476	476
query6	277	167	150	150
query7	4579	294	288	288
query8	248	221	194	194
query9	8750	2468	2438	2438
query10	454	281	264	264
query11	11861	10045	10042	10042
query12	133	85	89	85
query13	1627	380	378	378
query14	10284	7747	7700	7700
query15	227	172	167	167
query16	7836	491	486	486
query17	1597	564	549	549
query18	2010	288	283	283
query19	199	148	148	148
query20	94	88	94	88
query21	213	101	103	101
query22	4150	4252	3897	3897
query23	34015	33171	32972	32972
query24	12225	2916	2887	2887
query25	698	393	399	393
query26	1803	149	155	149
query27	2982	274	278	274
query28	7331	2053	2033	2033
query29	1122	438	429	429
query30	290	153	151	151
query31	952	751	764	751
query32	96	55	57	55
query33	792	349	349	349
query34	906	470	475	470
query35	873	730	755	730
query36	1097	901	929	901
query37	297	83	78	78
query38	2848	2736	2740	2736
query39	901	792	831	792
query40	272	119	116	116
query41	49	47	46	46
query42	119	102	110	102
query43	516	471	455	455
query44	1208	726	728	726
query45	213	180	178	178
query46	1094	785	735	735
query47	1838	1751	1744	1744
query48	378	304	299	299
query49	1209	435	430	430
query50	828	408	405	405
query51	6706	6695	6668	6668
query52	101	94	95	94
query53	260	183	186	183
query54	978	463	451	451
query55	75	75	75	75
query56	328	286	293	286
query57	1137	1036	1023	1023
query58	388	256	276	256
query59	2914	2828	2684	2684
query60	310	284	292	284
query61	94	114	103	103
query62	843	651	658	651
query63	207	191	183	183
query64	10476	2281	1758	1758
query65	3262	3102	3090	3090
query66	1370	335	340	335
query67	15624	14667	14648	14648
query68	9384	567	580	567
query69	766	413	324	324
query70	1402	1083	1080	1080
query71	549	271	265	265
query72	9166	5608	5912	5608
query73	2248	328	327	327
query74	6162	5721	5724	5721
query75	6137	2736	2674	2674
query76	5629	963	942	942
query77	792	325	304	304
query78	9602	9123	8963	8963
query79	9445	534	536	534
query80	921	502	510	502
query81	584	225	214	214
query82	283	135	133	133
query83	344	177	179	177
query84	267	77	80	77
query85	989	316	300	300
query86	359	329	300	300
query87	3314	3166	3052	3052
query88	5082	2491	2490	2490
query89	496	285	285	285
query90	2056	197	196	196
query91	126	101	101	101
query92	59	48	51	48
query93	5945	556	558	556
query94	1047	285	268	268
query95	364	265	263	263
query96	625	279	273	273
query97	3157	3056	3029	3029
query98	217	206	203	203
query99	1519	1279	1264	1264
Total cold run time: 312095 ms
Total hot run time: 172930 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.46 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit bcf4eaa6d9442783a8ff689f5c6ee38449481871, data reload: false

query1	0.05	0.04	0.04
query2	0.08	0.03	0.04
query3	0.22	0.04	0.05
query4	1.68	0.09	0.08
query5	0.51	0.49	0.48
query6	1.13	0.72	0.73
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.56	0.50	0.50
query10	0.53	0.55	0.54
query11	0.16	0.11	0.12
query12	0.15	0.12	0.12
query13	0.61	0.58	0.59
query14	0.76	0.78	0.78
query15	0.86	0.81	0.81
query16	0.36	0.37	0.36
query17	0.97	1.03	0.95
query18	0.23	0.22	0.21
query19	1.88	1.70	1.73
query20	0.02	0.01	0.03
query21	15.40	0.78	0.66
query22	3.88	7.67	1.89
query23	18.28	1.36	1.18
query24	2.19	0.24	0.23
query25	0.16	0.08	0.10
query26	0.30	0.22	0.21
query27	0.46	0.24	0.23
query28	13.20	1.02	1.02
query29	12.64	3.30	3.25
query30	0.25	0.06	0.06
query31	2.89	0.40	0.38
query32	3.28	0.49	0.47
query33	2.92	2.88	2.92
query34	17.00	4.33	4.38
query35	4.39	4.45	4.37
query36	0.65	0.49	0.49
query37	0.20	0.16	0.17
query38	0.17	0.16	0.16
query39	0.04	0.03	0.04
query40	0.17	0.13	0.13
query41	0.10	0.05	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.05
Total cold run time: 109.51 s
Total hot run time: 30.46 s

@hubgeter hubgeter force-pushed the feature_read_hive_rename_table_parquet_orc branch from bcf4eaa to 5cfcedd Compare July 29, 2024 07:25
@hubgeter
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 1157db4 into apache:master Aug 2, 2024
hubgeter added a commit to hubgeter/doris that referenced this pull request Aug 2, 2024
…es. (apache#38432)

Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
hubgeter added a commit to hubgeter/doris that referenced this pull request Aug 2, 2024
…es. (apache#38432)

Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
yiguolei pushed a commit that referenced this pull request Aug 5, 2024
…es. (#38432) (#38809)

bp #38432 

## Proposed changes
Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
hubgeter added a commit to hubgeter/doris that referenced this pull request Oct 14, 2024
…es. (apache#38432)

Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
hubgeter added a commit to hubgeter/doris that referenced this pull request Oct 16, 2024
…es. (apache#38432)

Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
morningman pushed a commit that referenced this pull request Apr 14, 2025
…rtition tb cause be core. (#49966)

### What problem does this PR solve?
related pr : #38432

Problem Summary:
when you query hive parquet format partition table, and `set
hive_parquet_use_column_names = false`, maybe you will get :
```
*** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265
11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586
````
The reason is that when `get_next_block` replaces the column name, data
out of bounds occurs.
github-actions bot pushed a commit that referenced this pull request Apr 14, 2025
…rtition tb cause be core. (#49966)

### What problem does this PR solve?
related pr : #38432

Problem Summary:
when you query hive parquet format partition table, and `set
hive_parquet_use_column_names = false`, maybe you will get :
```
*** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265
11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586
````
The reason is that when `get_next_block` replaces the column name, data
out of bounds occurs.
github-actions bot pushed a commit that referenced this pull request Apr 14, 2025
…rtition tb cause be core. (#49966)

### What problem does this PR solve?
related pr : #38432

Problem Summary:
when you query hive parquet format partition table, and `set
hive_parquet_use_column_names = false`, maybe you will get :
```
*** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265
11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586
````
The reason is that when `get_next_block` replaces the column name, data
out of bounds occurs.
seawinde pushed a commit to seawinde/doris that referenced this pull request Apr 17, 2025
…rtition tb cause be core. (apache#49966)

### What problem does this PR solve?
related pr : apache#38432

Problem Summary:
when you query hive parquet format partition table, and `set
hive_parquet_use_column_names = false`, maybe you will get :
```
*** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265
11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586
````
The reason is that when `get_next_block` replaces the column name, data
out of bounds occurs.
morningman pushed a commit that referenced this pull request Apr 23, 2025
…o read files, there will be multiple threads modify same object (#50161)

### What problem does this PR solve?
Related PR: #38432

Problem Summary:
in pr #38432 , if parquet reader use index to reade file and file column
name not eq table column name, reader will modify
_colname_to_value_range . However, this object is held by multiple vfile
scanners, and multi-threaded modification of this object will cause be
core.
hubgeter added a commit to hubgeter/doris that referenced this pull request Apr 28, 2025
…o read files, there will be multiple threads modify same object (apache#50161)

Related PR: apache#38432

Problem Summary:
in pr apache#38432 , if parquet reader use index to reade file and file column
name not eq table column name, reader will modify
_colname_to_value_range . However, this object is held by multiple vfile
scanners, and multi-threaded modification of this object will cause be
core.
morningman pushed a commit to hubgeter/doris that referenced this pull request Apr 29, 2025
…o read files, there will be multiple threads modify same object (apache#50161)

Related PR: apache#38432

Problem Summary:
in pr apache#38432 , if parquet reader use index to reade file and file column
name not eq table column name, reader will modify
_colname_to_value_range . However, this object is held by multiple vfile
scanners, and multi-threaded modification of this object will cause be
core.
morningman pushed a commit to hubgeter/doris that referenced this pull request May 6, 2025
…o read files, there will be multiple threads modify same object (apache#50161)

Related PR: apache#38432

Problem Summary:
in pr apache#38432 , if parquet reader use index to reade file and file column
name not eq table column name, reader will modify
_colname_to_value_range . However, this object is held by multiple vfile
scanners, and multi-threaded modification of this object will cause be
core.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…rtition tb cause be core. (apache#49966)

### What problem does this PR solve?
related pr : apache#38432

Problem Summary:
when you query hive parquet format partition table, and `set
hive_parquet_use_column_names = false`, maybe you will get :
```
*** SIGABRT unknown detail explain (@0x2f59de) received by PID 3103198 (TID 3110278 OR 0x7f51c8e63640) from PID 3103198; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
 1# 0x00007F55DFB45520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
 6# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
 7# 0x000055C8BD4E2041 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 8# 0x000055C8BD4E2194 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
 9# 0x000055C8BD4E2586 in /mnt/disk1/doris-clusters/doris-master/output/be/lib/doris_be
10# std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.tcc:265
11# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:586
````
The reason is that when `get_next_block` replaces the column name, data
out of bounds occurs.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…o read files, there will be multiple threads modify same object (apache#50161)

### What problem does this PR solve?
Related PR: apache#38432

Problem Summary:
in pr apache#38432 , if parquet reader use index to reade file and file column
name not eq table column name, reader will modify
_colname_to_value_range . However, this object is held by multiple vfile
scanners, and multi-threaded modification of this object will cause be
core.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.6-merged dev/3.0.3-merged meta-change reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants