Skip to content

Conversation

@hubgeter
Copy link
Contributor

bp #38432

Proposed changes

Add hive_parquet_use_column_names and hive_orc_use_column_names session variables to read the table after rename column in Hive.

These two session variables are referenced from parquet_use_column_names and orc_use_column_names of Trino hive connector.

By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition.

For example:

in Hive : 
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris : 
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

You can use set parquet.column.index.access/orc.force.positional.evolution = true/false in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40304 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 27231b3972b761be581a760fbce0148082129c6d, data reload: false

------ Round 1 ----------------------------------
q1	18229	9395	7342	7342
q2	2151	189	151	151
q3	10652	1132	1108	1108
q4	10493	740	681	681
q5	7764	2801	2759	2759
q6	230	147	143	143
q7	961	633	641	633
q8	9356	1879	1902	1879
q9	6513	6413	6363	6363
q10	6926	2331	2283	2283
q11	441	240	243	240
q12	402	206	214	206
q13	17766	2918	2942	2918
q14	244	212	205	205
q15	558	515	510	510
q16	507	415	417	415
q17	967	645	578	578
q18	7375	6609	6652	6609
q19	2357	942	890	890
q20	569	272	273	272
q21	3939	3180	3141	3141
q22	1088	1005	978	978
Total cold run time: 109488 ms
Total hot run time: 40304 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7382	7210	7187	7187
q2	317	222	221	221
q3	2860	2683	2717	2683
q4	1911	1675	1685	1675
q5	5369	5378	5410	5378
q6	217	138	137	137
q7	2092	1665	1639	1639
q8	3179	3356	3380	3356
q9	8511	8488	8441	8441
q10	3405	3381	3368	3368
q11	575	479	456	456
q12	763	569	568	568
q13	16916	2979	2960	2960
q14	280	266	255	255
q15	543	515	518	515
q16	482	451	439	439
q17	1784	1569	1556	1556
q18	7803	7434	7269	7269
q19	1648	1376	1410	1376
q20	1963	1790	1842	1790
q21	5108	4896	4981	4896
q22	1087	992	989	989
Total cold run time: 74195 ms
Total hot run time: 57154 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 188362 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 27231b3972b761be581a760fbce0148082129c6d, data reload: false

query1	962	364	355	355
query2	6548	1966	1928	1928
query3	6700	208	215	208
query4	34045	23354	23396	23354
query5	4332	447	449	447
query6	253	173	151	151
query7	4611	303	300	300
query8	240	199	191	191
query9	9503	2658	2625	2625
query10	482	273	278	273
query11	18314	15228	15325	15228
query12	155	95	97	95
query13	1696	458	428	428
query14	9769	6585	7176	6585
query15	219	183	174	174
query16	7749	479	486	479
query17	1606	569	545	545
query18	1994	295	300	295
query19	220	146	145	145
query20	116	106	106	106
query21	209	104	98	98
query22	4561	4342	4126	4126
query23	34659	34111	33840	33840
query24	11479	2852	2796	2796
query25	638	368	388	368
query26	1652	154	156	154
query27	2729	295	292	292
query28	7937	2136	2117	2117
query29	972	419	406	406
query30	309	153	151	151
query31	1038	776	826	776
query32	96	51	59	51
query33	761	285	269	269
query34	1006	481	506	481
query35	858	741	722	722
query36	1101	929	935	929
query37	131	72	76	72
query38	3998	3867	3820	3820
query39	1495	1414	1443	1414
query40	291	94	93	93
query41	51	42	43	42
query42	111	94	96	94
query43	516	476	472	472
query44	1257	763	762	762
query45	192	167	163	163
query46	1104	702	680	680
query47	1902	1818	1817	1817
query48	405	355	327	327
query49	1245	372	364	364
query50	797	398	402	398
query51	7065	6839	6798	6798
query52	103	92	85	85
query53	252	187	182	182
query54	1145	453	444	444
query55	76	74	70	70
query56	246	238	237	237
query57	1271	1151	1148	1148
query58	235	229	229	229
query59	3151	2841	2730	2730
query60	294	259	263	259
query61	97	95	98	95
query62	868	651	639	639
query63	211	187	186	186
query64	5222	602	599	599
query65	3251	3159	3185	3159
query66	1121	305	306	305
query67	15786	15667	15372	15372
query68	4468	544	526	526
query69	428	291	287	287
query70	1119	1115	1129	1115
query71	332	273	283	273
query72	6510	4002	4261	4002
query73	759	338	345	338
query74	10126	8961	8941	8941
query75	3378	2626	2629	2626
query76	2645	1014	881	881
query77	410	310	300	300
query78	9937	9258	9257	9257
query79	5350	572	587	572
query80	2022	453	459	453
query81	562	226	229	226
query82	951	132	130	130
query83	245	132	134	132
query84	300	82	77	77
query85	2251	276	273	273
query86	510	283	293	283
query87	4456	4298	4253	4253
query88	4761	2352	2356	2352
query89	439	282	285	282
query90	1905	185	184	184
query91	138	103	104	103
query92	70	48	45	45
query93	5128	550	524	524
query94	940	290	288	288
query95	339	242	243	242
query96	618	284	277	277
query97	3284	3087	3116	3087
query98	224	199	198	198
query99	1850	1292	1304	1292
Total cold run time: 309863 ms
Total hot run time: 188362 ms

…es. (apache#38432)

Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
@hubgeter hubgeter force-pushed the pick_30_feature_read_hive_rename_table_parquet_orc branch from 27231b3 to 4f8ea73 Compare October 16, 2024 08:26
@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40271 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4f8ea7332d0fc2f70c08539909d0ea3d4ea19f01, data reload: false

------ Round 1 ----------------------------------
q1	17974	7488	7314	7314
q2	2056	153	143	143
q3	10737	1049	1142	1049
q4	10542	772	714	714
q5	7733	2769	2851	2769
q6	226	144	145	144
q7	1015	642	623	623
q8	9343	1890	1963	1890
q9	6497	6401	6400	6400
q10	7004	2323	2310	2310
q11	451	250	251	250
q12	406	218	221	218
q13	17761	2966	2979	2966
q14	242	214	207	207
q15	552	502	510	502
q16	488	411	405	405
q17	972	593	612	593
q18	7196	6612	6575	6575
q19	3709	997	934	934
q20	575	272	271	271
q21	3960	3077	3040	3040
q22	1064	1003	954	954
Total cold run time: 110503 ms
Total hot run time: 40271 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7435	7232	7210	7210
q2	325	231	230	230
q3	2991	2839	2917	2839
q4	2088	1856	1780	1780
q5	5697	5695	5702	5695
q6	220	141	143	141
q7	2178	1758	1748	1748
q8	3329	3606	3453	3453
q9	8910	8862	8770	8770
q10	3547	3509	3522	3509
q11	566	469	480	469
q12	787	611	610	610
q13	16453	3071	3161	3071
q14	317	281	277	277
q15	573	515	516	515
q16	520	453	457	453
q17	1843	1640	1617	1617
q18	8226	7600	7542	7542
q19	1682	1439	1408	1408
q20	2065	1860	1855	1855
q21	5438	5362	5345	5345
q22	1138	1016	1009	1009
Total cold run time: 76328 ms
Total hot run time: 59546 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192148 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4f8ea7332d0fc2f70c08539909d0ea3d4ea19f01, data reload: false

query1	1298	915	893	893
query2	6274	2009	1947	1947
query3	10778	3895	3810	3810
query4	66840	29303	23584	23584
query5	5516	461	452	452
query6	466	175	175	175
query7	6313	305	298	298
query8	307	201	207	201
query9	9527	2690	2631	2631
query10	488	284	265	265
query11	18243	15262	15721	15262
query12	160	96	98	96
query13	1614	451	430	430
query14	10912	6303	7413	6303
query15	212	170	174	170
query16	7251	534	509	509
query17	1044	554	547	547
query18	1792	309	303	303
query19	187	142	143	142
query20	111	106	104	104
query21	214	102	100	100
query22	4474	4397	4346	4346
query23	34197	33484	33889	33484
query24	5616	2914	2847	2847
query25	516	406	402	402
query26	677	166	162	162
query27	1707	307	292	292
query28	4062	2183	2160	2160
query29	667	429	429	429
query30	232	151	150	150
query31	982	778	810	778
query32	70	55	53	53
query33	431	300	286	286
query34	905	506	477	477
query35	858	734	717	717
query36	1028	876	910	876
query37	129	79	77	77
query38	3943	3824	3952	3824
query39	1486	1433	1409	1409
query40	205	99	97	97
query41	47	44	41	41
query42	125	98	99	98
query43	515	472	481	472
query44	1135	782	793	782
query45	197	164	158	158
query46	1137	715	721	715
query47	1893	1826	1828	1826
query48	425	351	338	338
query49	695	382	373	373
query50	830	408	405	405
query51	7076	6891	6771	6771
query52	106	86	92	86
query53	262	189	191	189
query54	579	457	466	457
query55	76	77	78	77
query56	280	265	269	265
query57	1226	1144	1171	1144
query58	230	230	245	230
query59	3063	2820	2809	2809
query60	281	280	278	278
query61	122	149	105	105
query62	777	637	677	637
query63	211	176	185	176
query64	1595	623	601	601
query65	3246	3145	3170	3145
query66	709	304	310	304
query67	15809	15264	15262	15262
query68	4046	546	543	543
query69	634	274	289	274
query70	1095	1072	1126	1072
query71	431	274	271	271
query72	7798	3877	4027	3877
query73	769	339	343	339
query74	10204	8819	8968	8819
query75	4151	2654	2652	2652
query76	3199	857	858	857
query77	762	294	295	294
query78	9974	9219	9039	9039
query79	8016	580	588	580
query80	1242	440	446	440
query81	540	222	217	217
query82	1360	127	127	127
query83	386	132	132	132
query84	291	85	81	81
query85	1661	301	293	293
query86	435	296	286	286
query87	4524	4269	4318	4269
query88	5643	2395	2415	2395
query89	422	288	285	285
query90	2130	190	179	179
query91	134	105	104	104
query92	66	44	49	44
query93	6542	551	525	525
query94	1049	277	291	277
query95	361	281	242	242
query96	641	289	280	280
query97	3304	3069	3112	3069
query98	223	190	203	190
query99	1770	1274	1264	1264
Total cold run time: 338182 ms
Total hot run time: 192148 ms

@morningman morningman merged commit 8bec0e9 into apache:branch-3.0 Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants