Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Sep 3, 2025

What problem does this PR solve?

Problem Summary:
This pull request improves the handling of empty string null formats and delimiter properties for Hive external tables, ensuring more robust and consistent behavior when parsing data.

For hive text table like this:

CREATE TABLE test_empty_null_defined_text (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;

INSERT INTO TABLE test_empty_null_defined_text VALUES
  (1, 'Alice'),
  (2, NULL);

Query in Doris:

select * from test_empty_null_defined_text;

Before Result:

+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 |       |
+------+-------+

After Result:

+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 | NULL  |
+------+-------+

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Sep 3, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223 suxiaogang223 changed the title [fix](hive) support querying hive text table with 'serialization.format'='' [fix](hive) support querying hive text table with NULL DEFINED AS '' Sep 3, 2025
@suxiaogang223 suxiaogang223 changed the title [fix](hive) support querying hive text table with NULL DEFINED AS '' [fix](hive)fix querying hive text table with NULL DEFINED AS '' Sep 3, 2025
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34052 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c3159aadd35e5add4e186680a461c614df94920b, data reload: false

------ Round 1 ----------------------------------
q1	17614	5228	5047	5047
q2	2013	357	239	239
q3	10185	1286	709	709
q4	10235	1017	522	522
q5	7562	2449	2333	2333
q6	183	166	136	136
q7	919	747	635	635
q8	9344	1395	1119	1119
q9	6895	5146	5365	5146
q10	6937	2366	1969	1969
q11	506	305	289	289
q12	367	372	236	236
q13	17790	3664	2998	2998
q14	244	259	224	224
q15	578	511	486	486
q16	429	427	381	381
q17	598	864	367	367
q18	7575	7164	6988	6988
q19	1610	964	588	588
q20	347	331	229	229
q21	3910	2572	2395	2395
q22	1067	1040	1016	1016
Total cold run time: 106908 ms
Total hot run time: 34052 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5192	5082	5077	5077
q2	263	335	231	231
q3	2101	2679	2284	2284
q4	1330	1755	1317	1317
q5	4165	4288	4547	4288
q6	225	201	151	151
q7	2048	2010	1930	1930
q8	2664	2733	2643	2643
q9	7302	7419	7262	7262
q10	3101	3281	2855	2855
q11	584	520	496	496
q12	678	793	664	664
q13	3544	4021	3272	3272
q14	290	307	269	269
q15	517	494	473	473
q16	540	581	441	441
q17	1147	1600	1364	1364
q18	7837	7570	7716	7570
q19	875	827	830	827
q20	2024	2051	1886	1886
q21	5056	4345	4395	4345
q22	1084	1050	1000	1000
Total cold run time: 52567 ms
Total hot run time: 50645 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186395 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c3159aadd35e5add4e186680a461c614df94920b, data reload: false

query1	1055	452	412	412
query2	6580	1739	1750	1739
query3	6765	228	221	221
query4	26146	23354	22760	22760
query5	4389	660	516	516
query6	351	247	238	238
query7	4654	528	327	327
query8	310	270	254	254
query9	8662	2910	2913	2910
query10	503	376	305	305
query11	15778	14914	14768	14768
query12	178	122	125	122
query13	1691	560	455	455
query14	8609	5798	5893	5798
query15	218	206	185	185
query16	7763	685	522	522
query17	1271	781	641	641
query18	2043	456	360	360
query19	270	199	184	184
query20	140	129	129	129
query21	215	129	117	117
query22	4325	4348	4193	4193
query23	34252	33090	33072	33072
query24	8110	2367	2432	2367
query25	575	518	455	455
query26	1087	282	166	166
query27	2700	512	363	363
query28	4334	2277	2241	2241
query29	747	613	502	502
query30	289	222	189	189
query31	903	796	710	710
query32	95	84	79	79
query33	591	397	353	353
query34	810	857	545	545
query35	825	816	778	778
query36	1008	1049	933	933
query37	140	111	90	90
query38	4089	4128	4055	4055
query39	1484	1444	1437	1437
query40	231	139	130	130
query41	65	60	61	60
query42	130	153	120	120
query43	527	507	450	450
query44	1394	863	855	855
query45	181	173	183	173
query46	862	1019	660	660
query47	1786	1907	1726	1726
query48	411	439	321	321
query49	744	504	420	420
query50	651	680	409	409
query51	4131	4198	4128	4128
query52	124	120	111	111
query53	248	278	202	202
query54	610	609	534	534
query55	99	98	87	87
query56	354	330	337	330
query57	1208	1217	1141	1141
query58	292	287	287	287
query59	2598	2777	2509	2509
query60	359	353	374	353
query61	166	155	156	155
query62	832	744	682	682
query63	228	195	194	194
query64	4040	1143	837	837
query65	4251	4183	4149	4149
query66	1029	447	349	349
query67	15569	15358	15161	15161
query68	9045	954	586	586
query69	503	337	297	297
query70	1190	1135	1159	1135
query71	573	346	333	333
query72	5771	5016	4980	4980
query73	750	613	362	362
query74	9102	9102	8627	8627
query75	4225	3122	2666	2666
query76	3624	1153	747	747
query77	827	425	336	336
query78	9632	9781	8860	8860
query79	1764	845	588	588
query80	689	597	524	524
query81	469	263	232	232
query82	202	145	116	116
query83	298	260	249	249
query84	305	108	92	92
query85	856	472	438	438
query86	358	329	321	321
query87	4347	4263	4238	4238
query88	2856	2313	2220	2220
query89	403	334	294	294
query90	2059	228	221	221
query91	161	171	129	129
query92	94	80	78	78
query93	1135	994	679	679
query94	676	405	324	324
query95	410	344	331	331
query96	482	600	280	280
query97	2604	2667	2570	2570
query98	247	230	228	228
query99	1408	1459	1290	1290
Total cold run time: 274140 ms
Total hot run time: 186395 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.09 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c3159aadd35e5add4e186680a461c614df94920b, data reload: false

query1	0.05	0.06	0.05
query2	0.10	0.05	0.05
query3	0.26	0.08	0.08
query4	1.61	0.12	0.12
query5	0.46	0.41	0.43
query6	1.20	0.63	0.68
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.59	0.55	0.52
query10	0.59	0.59	0.57
query11	0.17	0.12	0.11
query12	0.15	0.13	0.12
query13	0.63	0.62	0.61
query14	0.79	0.85	0.83
query15	0.87	0.85	0.88
query16	0.40	0.43	0.38
query17	1.04	1.09	1.07
query18	0.23	0.20	0.20
query19	1.94	1.87	1.90
query20	0.02	0.01	0.02
query21	15.44	0.96	0.58
query22	0.76	1.20	0.74
query23	14.85	1.37	0.60
query24	6.64	1.13	1.41
query25	0.47	0.30	0.15
query26	0.64	0.14	0.13
query27	0.06	0.06	0.05
query28	10.19	0.97	0.43
query29	12.58	3.94	3.26
query30	3.06	2.98	2.95
query31	2.83	0.58	0.38
query32	3.23	0.56	0.47
query33	3.03	3.11	3.13
query34	16.11	5.42	4.82
query35	4.98	4.87	4.96
query36	0.68	0.51	0.49
query37	0.10	0.07	0.08
query38	0.06	0.05	0.04
query39	0.04	0.03	0.03
query40	0.19	0.15	0.15
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 107.3 s
Total hot run time: 33.09 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.84% (17206/33192)
Line Coverage 37.27% (157116/421568)
Region Coverage 31.89% (119901/375955)
Branch Coverage 33.29% (52662/158189)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 33.33% (1/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.63% (23022/32593)
Line Coverage 56.98% (240047/421301)
Region Coverage 52.26% (199270/381326)
Branch Coverage 54.04% (85960/159079)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 75.00% (3/4) 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34128 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 5d75b126bbcd54f6ab72a9bebbedacfce6b9e313, data reload: false

------ Round 1 ----------------------------------
q1	17611	5208	5072	5072
q2	2027	343	211	211
q3	10203	1330	711	711
q4	10228	1033	542	542
q5	7555	2434	2371	2371
q6	178	168	137	137
q7	950	761	616	616
q8	9343	1413	1133	1133
q9	7109	5096	5187	5096
q10	6896	2412	1986	1986
q11	474	304	282	282
q12	352	376	229	229
q13	17776	3677	3047	3047
q14	245	250	223	223
q15	557	513	491	491
q16	435	454	377	377
q17	606	872	369	369
q18	7636	7283	7074	7074
q19	1257	950	588	588
q20	351	347	228	228
q21	3828	2587	2337	2337
q22	1077	1051	1008	1008
Total cold run time: 106694 ms
Total hot run time: 34128 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5188	5112	5152	5112
q2	247	334	234	234
q3	2172	2737	2301	2301
q4	1364	1799	1389	1389
q5	4223	4400	4618	4400
q6	236	187	148	148
q7	2107	1970	1888	1888
q8	2678	2587	2648	2587
q9	7469	7542	7327	7327
q10	3090	3276	2874	2874
q11	590	542	504	504
q12	681	803	639	639
q13	3512	3909	3256	3256
q14	306	332	278	278
q15	534	473	481	473
q16	479	526	448	448
q17	1233	1597	1413	1413
q18	8196	7787	7610	7610
q19	816	814	871	814
q20	2013	2146	1964	1964
q21	5035	4720	4477	4477
q22	1094	1083	1022	1022
Total cold run time: 53263 ms
Total hot run time: 51158 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186808 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 5d75b126bbcd54f6ab72a9bebbedacfce6b9e313, data reload: false

query1	1067	439	410	410
query2	6596	1701	1742	1701
query3	6759	228	223	223
query4	26253	23126	22989	22989
query5	4397	668	523	523
query6	349	263	232	232
query7	4652	522	311	311
query8	314	273	260	260
query9	8679	2951	2993	2951
query10	526	350	303	303
query11	16011	14941	14662	14662
query12	180	130	126	126
query13	1684	590	452	452
query14	9315	5910	5862	5862
query15	219	190	179	179
query16	7760	683	518	518
query17	1241	760	625	625
query18	2060	437	337	337
query19	204	203	175	175
query20	143	124	124	124
query21	212	138	121	121
query22	4018	4211	4042	4042
query23	34300	33255	32996	32996
query24	8204	2369	2404	2369
query25	610	515	462	462
query26	1262	285	165	165
query27	2712	510	356	356
query28	4385	2298	2272	2272
query29	784	619	492	492
query30	296	220	203	203
query31	914	816	725	725
query32	92	78	79	78
query33	573	391	366	366
query34	810	881	538	538
query35	842	826	744	744
query36	977	1025	935	935
query37	128	113	94	94
query38	4040	4135	4019	4019
query39	1503	1489	1434	1434
query40	238	137	130	130
query41	63	70	60	60
query42	132	117	118	117
query43	529	553	470	470
query44	1381	883	878	878
query45	186	181	170	170
query46	857	1020	663	663
query47	1770	1805	1761	1761
query48	390	432	326	326
query49	765	511	449	449
query50	656	686	418	418
query51	4256	4144	4071	4071
query52	115	116	103	103
query53	241	276	206	206
query54	628	619	564	564
query55	102	96	94	94
query56	391	340	334	334
query57	1249	1223	1154	1154
query58	298	319	284	284
query59	2580	2683	2579	2579
query60	374	355	345	345
query61	167	158	164	158
query62	837	720	665	665
query63	233	196	201	196
query64	4540	1170	856	856
query65	4302	4255	4523	4255
query66	1101	457	373	373
query67	15541	15555	15116	15116
query68	9730	936	591	591
query69	498	331	310	310
query70	1278	1128	1109	1109
query71	564	344	329	329
query72	5663	5057	5194	5057
query73	766	640	364	364
query74	9293	9213	8726	8726
query75	4266	3082	2641	2641
query76	4385	1160	750	750
query77	1008	413	333	333
query78	9592	9883	8835	8835
query79	1920	857	593	593
query80	688	588	568	568
query81	478	278	254	254
query82	454	145	112	112
query83	290	270	256	256
query84	304	113	99	99
query85	907	475	437	437
query86	362	317	306	306
query87	4280	4247	4235	4235
query88	2859	2269	2245	2245
query89	412	329	293	293
query90	1930	237	240	237
query91	164	172	141	141
query92	91	76	75	75
query93	1440	980	667	667
query94	708	438	330	330
query95	417	342	357	342
query96	488	648	281	281
query97	2650	2720	2592	2592
query98	249	226	220	220
query99	1441	1441	1311	1311
Total cold run time: 278568 ms
Total hot run time: 186808 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.68 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5d75b126bbcd54f6ab72a9bebbedacfce6b9e313, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.06	0.05
query3	0.25	0.09	0.08
query4	1.61	0.12	0.12
query5	0.45	0.42	0.42
query6	1.19	0.64	0.64
query7	0.04	0.03	0.03
query8	0.05	0.05	0.05
query9	0.61	0.53	0.51
query10	0.58	0.59	0.58
query11	0.17	0.12	0.11
query12	0.16	0.13	0.12
query13	0.64	0.63	0.61
query14	0.80	0.84	0.84
query15	0.88	0.85	0.85
query16	0.41	0.40	0.39
query17	1.03	1.04	1.04
query18	0.22	0.21	0.22
query19	1.95	1.82	1.86
query20	0.01	0.01	0.01
query21	15.40	0.99	0.58
query22	0.81	1.10	0.71
query23	14.91	1.37	0.65
query24	7.13	0.69	0.59
query25	0.48	0.13	0.10
query26	0.75	0.17	0.14
query27	0.06	0.06	0.06
query28	9.57	0.92	0.43
query29	12.59	3.82	3.24
query30	3.10	3.03	3.01
query31	2.82	0.59	0.40
query32	3.24	0.55	0.48
query33	3.14	3.12	3.07
query34	16.29	5.53	4.91
query35	4.89	4.91	5.00
query36	0.71	0.52	0.50
query37	0.11	0.08	0.07
query38	0.07	0.05	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.14
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 107.66 s
Total hot run time: 32.68 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 25.00% (1/4) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.84% (17205/33189)
Line Coverage 37.27% (157128/421599)
Region Coverage 31.87% (119829/375976)
Branch Coverage 33.28% (52657/158201)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 33.33% (1/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 70.68% (23034/32590)
Line Coverage 56.97% (240032/421330)
Region Coverage 52.37% (199701/381347)
Branch Coverage 54.05% (85992/159091)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 75.00% (3/4) 🎉
Increment coverage report
Complete coverage report

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Sep 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2025

PR approved by anyone and no changes requested.

@morningman morningman merged commit db955c2 into apache:master Sep 4, 2025
27 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Sep 4, 2025
### What problem does this PR solve?
Problem Summary:
This pull request improves the handling of empty string null formats and
delimiter properties for Hive external tables, ensuring more robust and
consistent behavior when parsing data.

For hive text table like this:
```sql
CREATE TABLE test_empty_null_defined_text (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;

INSERT INTO TABLE test_empty_null_defined_text VALUES
  (1, 'Alice'),
  (2, NULL);
```
Query in Doris:
```sql
select * from test_empty_null_defined_text;
```
Before Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 |       |
+------+-------+
```
After Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 | NULL  |
+------+-------+
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Sep 5, 2025
…he#55626)

Problem Summary:
This pull request improves the handling of empty string null formats and
delimiter properties for Hive external tables, ensuring more robust and
consistent behavior when parsing data.

For hive text table like this:
```sql
CREATE TABLE test_empty_null_defined_text (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;

INSERT INTO TABLE test_empty_null_defined_text VALUES
  (1, 'Alice'),
  (2, NULL);
```
Query in Doris:
```sql
select * from test_empty_null_defined_text;
```
Before Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 |       |
+------+-------+
```
After Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 | NULL  |
+------+-------+
```
morrySnow pushed a commit that referenced this pull request Sep 5, 2025
… AS '' #55626 (#55661)

Cherry-picked from #55626

Co-authored-by: Socrates <suyiteng@selectdb.com>
wenzhenghu pushed a commit to wenzhenghu/doris that referenced this pull request Sep 8, 2025
…he#55626)

### What problem does this PR solve?
Problem Summary:
This pull request improves the handling of empty string null formats and
delimiter properties for Hive external tables, ensuring more robust and
consistent behavior when parsing data.

For hive text table like this:
```sql
CREATE TABLE test_empty_null_defined_text (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
NULL DEFINED AS ''
STORED AS TEXTFILE;

INSERT INTO TABLE test_empty_null_defined_text VALUES
  (1, 'Alice'),
  (2, NULL);
```
Query in Doris:
```sql
select * from test_empty_null_defined_text;
```
Before Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 |       |
+------+-------+
```
After Result:
```text
+------+-------+
| id   | name  |
+------+-------+
|    1 | Alice |
|    2 | NULL  |
+------+-------+
```
@morrySnow morrySnow mentioned this pull request Sep 22, 2025
@suxiaogang223 suxiaogang223 deleted the fix_empty_null_format branch September 23, 2025 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.9-merged dev/3.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants