Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Mar 18, 2025

What problem does this PR solve?

Problem Summary:
Initial support for Hive org.openx.data.jsonserde.JsonSerDe(https://github.com/rcongiu/Hive-JSON-Serde).
The specific behavior of read is similar to pr #43469.

By referring to the description in the link, here are some explanations:
Support:

  1. Querying Complex Fields
  2. Importing Malformed Data (serde prop: ignore.malformed.json)

Not supported, this parameter will not affect the query results

  1. dots.in.keys
  2. Case Sensitivity in mappings
  3. Mapping Hive Keywords

Not supported, but will report an error:

  1. Using Arrays
  2. Promoting a Scalar to an Array
    error : [DATA_QUALITY_ERROR]JSON data is array-object, strip_outer_array must be TRUE.

In order to allow some json strings that do not support parsing to be processed by users, a session variable is introduced: read_hive_json_in_one_column (default is false). When this variable is true, a whole line of json is read into the first column, and users can choose to process a whole line of json, such as JSON_PARSE. The data type of the first column of the table needs to be string. Currently only valid for org.openx.data.jsonserde.JsonSerDe.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 18, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 82.88% (1075/1297)
Line Coverage: 65.85% (17752/26959)
Region Coverage: 65.18% (8743/13414)
Branch Coverage: 55.05% (4713/8562)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ae559aab413b889c7187f87759ca9d1bb82309f6_ae559aab413b889c7187f87759ca9d1bb82309f6_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 32970 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ae559aab413b889c7187f87759ca9d1bb82309f6, data reload: false

------ Round 1 ----------------------------------
q1	24606	5146	5122	5122
q2	2051	340	194	194
q3	10387	1279	715	715
q4	10230	1000	544	544
q5	7565	2417	2393	2393
q6	192	173	139	139
q7	963	770	625	625
q8	9328	1338	1158	1158
q9	5048	4826	4876	4826
q10	6824	2316	1893	1893
q11	479	283	258	258
q12	356	357	223	223
q13	17782	3687	3069	3069
q14	227	224	220	220
q15	552	505	502	502
q16	617	606	593	593
q17	589	882	356	356
q18	6941	6511	6442	6442
q19	1921	975	571	571
q20	316	331	194	194
q21	2806	2245	1943	1943
q22	1091	1029	990	990
Total cold run time: 110871 ms
Total hot run time: 32970 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5358	5211	5215	5211
q2	243	340	243	243
q3	2174	2673	2248	2248
q4	1418	1802	1392	1392
q5	4246	4283	4537	4283
q6	217	174	131	131
q7	2078	1917	1794	1794
q8	2652	2640	2582	2582
q9	7224	7333	7171	7171
q10	3009	3220	2808	2808
q11	580	529	494	494
q12	706	758	602	602
q13	3489	3844	3299	3299
q14	281	286	288	286
q15	536	499	491	491
q16	647	667	671	667
q17	1173	1638	1356	1356
q18	7820	7535	7452	7452
q19	823	811	849	811
q20	2031	2076	1870	1870
q21	5464	4844	4826	4826
q22	1087	1038	1008	1008
Total cold run time: 53256 ms
Total hot run time: 51025 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185231 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ae559aab413b889c7187f87759ca9d1bb82309f6, data reload: false

query1	1011	474	464	464
query2	6570	1896	1885	1885
query3	6791	213	219	213
query4	26727	23274	23514	23274
query5	4315	673	475	475
query6	297	193	208	193
query7	4594	495	295	295
query8	292	235	214	214
query9	8596	2576	2595	2576
query10	457	306	254	254
query11	15470	15202	14974	14974
query12	163	109	103	103
query13	1652	512	396	396
query14	8600	6073	6166	6073
query15	200	179	175	175
query16	7143	624	495	495
query17	952	711	567	567
query18	1959	405	313	313
query19	193	185	158	158
query20	120	159	117	117
query21	209	125	101	101
query22	4341	4419	4391	4391
query23	34071	32903	33083	32903
query24	8383	2354	2398	2354
query25	531	470	377	377
query26	1245	272	155	155
query27	2755	488	318	318
query28	4348	2406	2396	2396
query29	760	551	422	422
query30	285	219	187	187
query31	954	861	750	750
query32	70	61	64	61
query33	572	363	300	300
query34	778	837	502	502
query35	772	815	746	746
query36	969	985	879	879
query37	113	96	69	69
query38	4184	4156	4039	4039
query39	1441	1386	1421	1386
query40	209	113	106	106
query41	53	55	49	49
query42	115	100	105	100
query43	499	495	480	480
query44	1277	769	777	769
query45	170	166	157	157
query46	836	1014	617	617
query47	1752	1820	1713	1713
query48	378	407	299	299
query49	787	509	420	420
query50	695	727	410	410
query51	4218	4262	4150	4150
query52	110	105	96	96
query53	242	257	188	188
query54	494	501	426	426
query55	84	81	82	81
query56	282	275	254	254
query57	1140	1146	1062	1062
query58	245	239	237	237
query59	2503	2708	2457	2457
query60	288	282	271	271
query61	147	150	142	142
query62	799	735	693	693
query63	219	189	193	189
query64	4413	998	681	681
query65	4449	4308	4351	4308
query66	1142	403	306	306
query67	15762	15449	15487	15449
query68	7252	870	503	503
query69	488	289	261	261
query70	1204	1100	1095	1095
query71	420	295	279	279
query72	5751	3534	3734	3534
query73	744	740	355	355
query74	9096	8892	8742	8742
query75	3260	3123	2699	2699
query76	3166	1190	739	739
query77	468	434	278	278
query78	9837	10158	9339	9339
query79	1676	866	595	595
query80	662	538	450	450
query81	501	256	220	220
query82	197	127	94	94
query83	173	173	153	153
query84	294	171	71	71
query85	786	355	312	312
query86	341	303	287	287
query87	4362	4460	4472	4460
query88	2965	2271	2284	2271
query89	392	321	296	296
query90	1814	221	222	221
query91	137	142	108	108
query92	64	63	59	59
query93	1140	1042	599	599
query94	625	416	283	283
query95	402	272	266	266
query96	491	570	276	276
query97	3259	3402	3262	3262
query98	219	204	222	204
query99	1318	1372	1260	1260
Total cold run time: 269695 ms
Total hot run time: 185231 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.05 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ae559aab413b889c7187f87759ca9d1bb82309f6, data reload: false

query1	0.04	0.04	0.03
query2	0.13	0.12	0.10
query3	0.26	0.18	0.19
query4	1.60	0.19	0.19
query5	0.59	0.59	0.60
query6	1.19	0.72	0.72
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.60	0.51	0.51
query10	0.61	0.58	0.58
query11	0.16	0.11	0.10
query12	0.14	0.12	0.11
query13	0.61	0.60	0.61
query14	2.68	2.67	2.71
query15	0.92	0.84	0.84
query16	0.38	0.37	0.37
query17	1.04	1.02	1.04
query18	0.21	0.19	0.20
query19	1.91	1.86	1.91
query20	0.01	0.01	0.01
query21	15.38	0.93	0.55
query22	0.73	1.30	0.66
query23	14.82	1.38	0.64
query24	6.97	1.79	0.65
query25	0.51	0.25	0.10
query26	0.55	0.15	0.14
query27	0.05	0.05	0.05
query28	9.32	0.90	0.43
query29	12.57	3.97	3.30
query30	0.25	0.09	0.07
query31	2.82	0.58	0.38
query32	3.23	0.55	0.47
query33	2.99	3.06	3.09
query34	15.71	5.19	4.50
query35	4.57	4.58	4.53
query36	0.67	0.49	0.48
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.14	0.13
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 104.75 s
Total hot run time: 31.05 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 48.88% (13091/26781)
Line Coverage 38.43% (112826/293582)
Region Coverage 37.25% (57386/154071)
Branch Coverage 32.32% (28832/89202)

// in the data to lowercase,and use the last one as the insertion value

bool _openx_json_ignore_malformed = false;
// hive : org.openx.data.jsonserde.JsonSerDe, `ignore.malformed.json` prop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move comment before the field

@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.04% (1087/1309)
Line Coverage: 66.12% (18111/27390)
Region Coverage: 65.47% (8911/13611)
Branch Coverage: 55.33% (4800/8676)
Coverage Report: http://coverage.selectdb-in.cc/coverage/55578841b4f610423e761812f2e0e9c63584d741_55578841b4f610423e761812f2e0e9c63584d741_cloud/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 34703 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 55578841b4f610423e761812f2e0e9c63584d741, data reload: false

------ Round 1 ----------------------------------
q1	24825	5130	5040	5040
q2	2063	340	188	188
q3	10337	1289	709	709
q4	10226	1032	559	559
q5	7563	2493	2396	2396
q6	194	162	133	133
q7	945	765	623	623
q8	9338	1371	1176	1176
q9	6900	5162	5125	5125
q10	6841	2333	1917	1917
q11	496	282	268	268
q12	358	351	215	215
q13	17772	3717	3124	3124
q14	231	234	220	220
q15	540	489	489	489
q16	641	607	595	595
q17	612	884	374	374
q18	7824	7235	7176	7176
q19	1730	974	594	594
q20	340	328	206	206
q21	4540	3481	2596	2596
q22	1038	1017	980	980
Total cold run time: 115354 ms
Total hot run time: 34703 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5268	5156	5190	5156
q2	240	328	234	234
q3	2161	2680	2295	2295
q4	1453	1874	1479	1479
q5	4550	4484	4394	4394
q6	217	171	131	131
q7	2026	1938	1784	1784
q8	2721	2589	2656	2589
q9	7282	7127	6988	6988
q10	3060	3375	2739	2739
q11	577	508	493	493
q12	697	743	622	622
q13	3562	3890	3281	3281
q14	296	300	293	293
q15	550	484	482	482
q16	650	687	635	635
q17	1204	1604	1382	1382
q18	8016	7758	7459	7459
q19	872	858	970	858
q20	2003	1969	1853	1853
q21	5373	5000	4891	4891
q22	1114	1074	985	985
Total cold run time: 53892 ms
Total hot run time: 51023 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194406 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 55578841b4f610423e761812f2e0e9c63584d741, data reload: false

query1	1415	1071	1048	1048
query2	6277	1916	1955	1916
query3	11044	4555	4540	4540
query4	53503	24882	23438	23438
query5	5336	572	474	474
query6	411	198	185	185
query7	5335	497	270	270
query8	334	246	233	233
query9	7196	2602	2597	2597
query10	414	317	250	250
query11	15587	14991	14994	14991
query12	181	113	103	103
query13	1254	527	413	413
query14	11055	7747	7102	7102
query15	187	206	186	186
query16	6975	670	510	510
query17	1089	742	568	568
query18	1552	434	321	321
query19	200	200	178	178
query20	134	127	122	122
query21	211	172	102	102
query22	4409	4455	4231	4231
query23	34023	33420	33314	33314
query24	5712	2441	2421	2421
query25	458	445	388	388
query26	709	280	150	150
query27	1811	499	323	323
query28	2877	2480	2495	2480
query29	576	564	440	440
query30	272	225	190	190
query31	902	878	767	767
query32	73	60	66	60
query33	438	370	313	313
query34	772	879	514	514
query35	821	846	771	771
query36	929	1013	944	944
query37	120	106	74	74
query38	4208	4406	4238	4238
query39	1536	1422	1424	1422
query40	220	117	106	106
query41	53	53	53	53
query42	131	109	107	107
query43	502	524	482	482
query44	1349	814	827	814
query45	193	194	166	166
query46	883	1042	672	672
query47	1807	1843	1764	1764
query48	371	412	328	328
query49	726	517	419	419
query50	688	757	432	432
query51	4242	4314	4264	4264
query52	105	118	98	98
query53	216	271	180	180
query54	489	522	424	424
query55	86	79	87	79
query56	278	281	280	280
query57	1210	1191	1109	1109
query58	252	242	247	242
query59	2783	2848	2888	2848
query60	283	288	273	273
query61	128	126	128	126
query62	752	737	672	672
query63	238	195	203	195
query64	1670	1019	672	672
query65	4536	4540	4431	4431
query66	707	390	294	294
query67	15744	15429	15274	15274
query68	7238	832	503	503
query69	545	353	259	259
query70	1252	1154	1111	1111
query71	488	298	267	267
query72	5761	4935	5244	4935
query73	1074	688	347	347
query74	9006	9115	8944	8944
query75	3909	3241	2724	2724
query76	4471	1199	745	745
query77	635	364	281	281
query78	10045	10214	9271	9271
query79	2594	819	584	584
query80	733	505	431	431
query81	479	259	222	222
query82	661	123	94	94
query83	270	175	153	153
query84	288	98	75	75
query85	798	360	393	360
query86	391	306	303	303
query87	4437	4598	4272	4272
query88	3183	2229	2249	2229
query89	407	307	271	271
query90	1940	217	213	213
query91	223	139	110	110
query92	74	61	61	61
query93	1301	1040	588	588
query94	668	417	296	296
query95	351	270	256	256
query96	482	563	275	275
query97	3358	3429	3321	3321
query98	221	204	203	203
query99	1454	1386	1311	1311
Total cold run time: 299715 ms
Total hot run time: 194406 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.92 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 55578841b4f610423e761812f2e0e9c63584d741, data reload: false

query1	0.03	0.03	0.03
query2	0.14	0.11	0.12
query3	0.34	0.20	0.20
query4	1.58	0.21	0.20
query5	0.60	0.62	0.62
query6	1.18	0.72	0.73
query7	0.02	0.01	0.01
query8	0.06	0.05	0.04
query9	0.63	0.52	0.52
query10	0.58	0.58	0.58
query11	0.25	0.12	0.13
query12	0.25	0.13	0.13
query13	0.64	0.63	0.62
query14	2.68	2.70	2.68
query15	1.00	0.89	0.87
query16	0.37	0.38	0.38
query17	1.05	1.05	1.03
query18	0.19	0.19	0.18
query19	1.98	2.05	1.80
query20	0.01	0.01	0.02
query21	15.36	1.00	0.68
query22	0.92	1.03	0.79
query23	14.69	1.54	0.76
query24	5.33	0.61	0.29
query25	0.17	0.09	0.09
query26	0.57	0.22	0.18
query27	0.08	0.08	0.08
query28	11.04	1.13	0.58
query29	12.53	4.25	3.47
query30	0.28	0.09	0.06
query31	2.85	0.63	0.42
query32	3.23	0.62	0.50
query33	3.06	3.07	3.29
query34	16.46	5.13	4.50
query35	4.44	4.46	4.51
query36	0.64	0.51	0.49
query37	0.20	0.18	0.17
query38	0.17	0.15	0.16
query39	0.05	0.04	0.04
query40	0.19	0.16	0.15
query41	0.11	0.06	0.05
query42	0.07	0.05	0.06
query43	0.06	0.05	0.04
Total cold run time: 106.08 s
Total hot run time: 31.92 s

@hubgeter hubgeter marked this pull request as ready for review March 23, 2025 04:54
}

public boolean canReadHiveJsonInOneColumn() {
return ConnectContext.get().getSessionVariable().isReadHiveJsonInOneColumn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a sessionVariable instance in HiveScanNode. Use it instead of ConnectContext.get().getSessionVariable()

|| serDeLib.equals(HiveMetaStoreClientHelper.LEGACY_HIVE_JSON_SERDE)) {
type = TFileFormatType.FORMAT_JSON;
} else if (serDeLib.equals(HiveMetaStoreClientHelper.OPENX_JSON_SERDE)) {
if (hmsTable.canReadHiveJsonInOneColumn()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return error if READ_HIVE_JSON_IN_ONE_COLUMN is true but the first column is not string?

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hubgeter
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 24, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TeamCity cloud ut coverage result:
Function Coverage: 83.04% (1087/1309)
Line Coverage: 66.09% (18102/27390)
Region Coverage: 65.46% (8910/13611)
Branch Coverage: 55.36% (4803/8676)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4d9380741bdbbbb8e9b6e9a01bd90bd7629ac19d_4d9380741bdbbbb8e9b6e9a01bd90bd7629ac19d_cloud/report/index.html

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 50.21% (13438/26761)
Line Coverage 39.66% (116397/293501)
Region Coverage 38.38% (59177/154178)
Branch Coverage 33.51% (29886/89192)

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 30d14b6 into apache:master Apr 9, 2025
29 of 33 checks passed
hello-stephen added a commit that referenced this pull request Apr 10, 2025
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…JsonSerDe (apache#49209)

### What problem does this PR solve?

Problem Summary:
Initial support for Hive
`org.openx.data.jsonserde.JsonSerDe`(https://github.com/rcongiu/Hive-JSON-Serde).
The specific behavior of read  is similar to pr apache#43469.

By referring to the description in the link, here are some explanations:
Support:
1. Querying Complex Fields
2. Importing Malformed Data (serde prop: ignore.malformed.json)

Not supported, this parameter will not affect the query results
1. dots.in.keys
2. Case Sensitivity in mappings
3. Mapping Hive Keywords

Not supported, but will report an error:
1. Using Arrays
2. Promoting a Scalar to an Array
error : [DATA_QUALITY_ERROR]JSON data is array-object,
`strip_outer_array` must be TRUE.

In order to allow some json strings that do not support parsing to be
processed by users, a session variable is introduced:
`read_hive_json_in_one_column` (default is false). When this variable is
true, a whole line of json is read into the first column, and users can
choose to process a whole line of json, such as JSON_PARSE. The data
type of the first column of the table needs to be string. Currently only
valid for org.openx.data.jsonserde.JsonSerDe.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants