Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Jan 6, 2026

What problem does this PR solve?

Problem Summary:
When querying struct fields in Iceberg tables after schema evolution, if all queried struct fields are missing in old Parquet files, the code fails with error:

File column name 'removed' not found in struct children

Root Cause:
When all queried struct sub-fields are missing in the old Parquet file (e.g., newly added fields after schema evolution), the code needs to find a reference column from the file schema to get repetition level (RL) and definition level (DL) information. However, if the reference column (e.g., removed) was dropped from the table schema, calling root_node->get_children_node_by_file_column_name() will fail because the column doesn't exist in root_node.

Scenario:

  1. Create table with struct containing: removed, rename, keep, drop_and_add
  2. Insert data (creates Parquet file with these fields)
  3. Perform schema evolution: DROP a_struct.removed, DROP then ADD a_struct.drop_and_add (gets new field ID), ADD a_struct.added
  4. Query struct_element(a_struct, 'drop_and_add') or struct_element(a_struct, 'added') on the old file
  5. The query fails because:
    • All queried fields (drop_and_add, added) are missing in the old file
    • Code tries to use removed as reference column (it exists in file but was dropped from table schema)
    • Accessing removed via root_node fails because it doesn't exist in table schema

Solution:

Use TableSchemaChangeHelper::ConstNode::get_instance() instead of looking up from root_node for the reference column. Since the reference column is only used to get RL/DL information (not for schema mapping), using ConstNode is safe and avoids the issue where the reference column doesn't exist in root_node.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 changed the title fix [fix](parquet) Fix struct column reading error when all queried fields are missing after schema evolution Jan 6, 2026
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32200 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 43c14fccc88ddf91c76c3401bf92fbfd8afa95d4, data reload: false

------ Round 1 ----------------------------------
q1	17588	4210	4067	4067
q2	2076	370	254	254
q3	10101	1269	755	755
q4	10227	890	332	332
q5	7521	2105	1827	1827
q6	220	174	137	137
q7	951	803	683	683
q8	9288	1387	1214	1214
q9	4870	4614	4526	4526
q10	6747	1835	1407	1407
q11	527	282	287	282
q12	688	718	601	601
q13	17762	3824	3172	3172
q14	289	289	286	286
q15	567	507	496	496
q16	683	692	643	643
q17	661	781	577	577
q18	6425	6454	6877	6454
q19	1300	993	680	680
q20	427	387	264	264
q21	3292	2530	2601	2530
q22	1152	1087	1013	1013
Total cold run time: 103362 ms
Total hot run time: 32200 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4412	4268	4264	4264
q2	343	440	339	339
q3	2268	2822	2372	2372
q4	1437	1839	1498	1498
q5	4570	4269	4350	4269
q6	219	169	124	124
q7	1972	1931	1733	1733
q8	2528	2613	2375	2375
q9	7090	7172	7036	7036
q10	2446	2722	2215	2215
q11	584	468	454	454
q12	735	746	597	597
q13	3635	4053	3246	3246
q14	265	292	256	256
q15	527	481	481	481
q16	652	642	604	604
q17	1077	1224	1224	1224
q18	7674	7371	7231	7231
q19	826	797	818	797
q20	1888	1947	1804	1804
q21	4571	4297	4074	4074
q22	1093	1045	962	962
Total cold run time: 50812 ms
Total hot run time: 47955 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172719 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 43c14fccc88ddf91c76c3401bf92fbfd8afa95d4, data reload: false

query5	5868	593	431	431
query6	332	224	207	207
query7	4209	460	263	263
query8	328	249	230	230
query9	8781	2665	2671	2665
query10	534	366	325	325
query11	15318	15263	14907	14907
query12	183	116	112	112
query13	1243	506	379	379
query14	7736	2983	2746	2746
query14_1	2663	2629	2637	2629
query15	260	196	179	179
query16	1013	467	458	458
query17	1303	681	599	599
query18	2698	422	329	329
query19	291	217	188	188
query20	124	117	113	113
query21	215	141	114	114
query22	3865	4108	3928	3928
query23	16040	15534	15231	15231
query23_1	15484	15339	15428	15339
query24	7044	1586	1170	1170
query24_1	1208	1160	1207	1160
query25	535	454	388	388
query26	1218	266	154	154
query27	2709	467	291	291
query28	4559	2148	2135	2135
query29	800	570	431	431
query30	313	238	206	206
query31	764	630	548	548
query32	76	69	65	65
query33	526	333	281	281
query34	866	881	517	517
query35	746	764	717	717
query36	839	855	853	853
query37	128	92	78	78
query38	2697	2776	2685	2685
query39	758	755	740	740
query39_1	722	691	698	691
query40	212	133	116	116
query41	64	61	65	61
query42	104	103	102	102
query43	454	435	403	403
query44	1318	720	715	715
query45	186	185	178	178
query46	852	957	607	607
query47	1414	1477	1342	1342
query48	308	319	245	245
query49	603	411	322	322
query50	627	276	211	211
query51	3795	3815	3819	3815
query52	103	107	93	93
query53	292	322	272	272
query54	276	256	248	248
query55	80	75	75	75
query56	290	283	288	283
query57	1017	968	883	883
query58	263	270	262	262
query59	2109	2217	2175	2175
query60	318	314	279	279
query61	166	161	156	156
query62	399	346	340	340
query63	301	266	274	266
query64	5075	1428	1148	1148
query65	3804	3688	3638	3638
query66	1360	434	325	325
query67	15252	15920	14799	14799
query68	4021	1036	730	730
query69	499	364	323	323
query70	1059	931	857	857
query71	377	318	283	283
query72	6295	3467	3438	3438
query73	763	720	305	305
query74	8791	8849	8631	8631
query75	2824	2801	2450	2450
query76	3740	1067	639	639
query77	528	372	279	279
query78	9809	9802	9106	9106
query79	1679	924	588	588
query80	1617	547	479	479
query81	557	264	231	231
query82	410	144	110	110
query83	380	255	230	230
query84	270	125	118	118
query85	926	533	464	464
query86	486	309	284	284
query87	2903	2861	2722	2722
query88	3163	2236	2203	2203
query89	388	352	318	318
query90	1926	157	155	155
query91	169	170	142	142
query92	77	73	64	64
query93	1066	898	530	530
query94	655	320	284	284
query95	545	381	294	294
query96	574	481	203	203
query97	2303	2382	2279	2279
query98	235	209	193	193
query99	566	603	523	523
Total cold run time: 254789 ms
Total hot run time: 172719 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.98 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 43c14fccc88ddf91c76c3401bf92fbfd8afa95d4, data reload: false

query1	0.06	0.05	0.04
query2	0.10	0.04	0.04
query3	0.26	0.09	0.09
query4	1.61	0.12	0.11
query5	0.26	0.26	0.26
query6	1.15	0.64	0.65
query7	0.03	0.03	0.03
query8	0.06	0.04	0.04
query9	0.57	0.51	0.49
query10	0.57	0.55	0.56
query11	0.15	0.10	0.11
query12	0.15	0.11	0.10
query13	0.61	0.60	0.60
query14	0.95	0.95	0.96
query15	0.79	0.78	0.78
query16	0.41	0.40	0.41
query17	0.99	1.06	1.01
query18	0.24	0.22	0.21
query19	1.94	1.90	1.76
query20	0.02	0.02	0.01
query21	15.45	0.25	0.14
query22	5.44	0.05	0.05
query23	15.79	0.30	0.10
query24	1.78	0.34	0.59
query25	0.10	0.06	0.06
query26	0.14	0.14	0.13
query27	0.10	0.05	0.05
query28	4.36	1.07	0.88
query29	12.61	3.98	3.16
query30	0.28	0.14	0.12
query31	2.83	0.68	0.40
query32	3.23	0.58	0.46
query33	3.05	3.10	3.08
query34	16.83	5.13	4.49
query35	4.48	4.54	4.47
query36	0.67	0.49	0.49
query37	0.10	0.07	0.07
query38	0.06	0.04	0.04
query39	0.05	0.03	0.03
query40	0.16	0.14	0.13
query41	0.08	0.04	0.03
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 98.59 s
Total hot run time: 26.98 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.22% (18967/35637)
Line Coverage 39.23% (176067/448799)
Region Coverage 33.75% (136213/403632)
Branch Coverage 34.71% (58770/169312)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (1/1) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 66.36% (23118/34838)
Line Coverage 53.22% (238218/447581)
Region Coverage 48.56% (197989/407752)
Branch Coverage 49.51% (84102/169861)

This commit introduces tests for handling mixed case field names in struct schema evolution. It includes the creation of test tables with mixed case fields, schema evolution operations, and corresponding data insertions for both Parquet and ORC formats. The tests verify that case sensitivity is correctly managed during schema evolution and querying operations.
@suxiaogang223
Copy link
Contributor Author

run external

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31406 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bd31370edb4e0d78719d98000f2613682e7a021f, data reload: false

------ Round 1 ----------------------------------
q1	17663	4204	4066	4066
q2	2048	359	240	240
q3	10166	1265	707	707
q4	10206	812	315	315
q5	7523	2052	1794	1794
q6	183	173	141	141
q7	931	784	662	662
q8	9264	1384	1122	1122
q9	4934	4564	4506	4506
q10	6795	1795	1423	1423
q11	522	297	284	284
q12	687	738	605	605
q13	17783	3757	3119	3119
q14	304	294	267	267
q15	587	503	504	503
q16	693	667	633	633
q17	681	728	594	594
q18	6615	6348	6464	6348
q19	1088	955	587	587
q20	406	365	251	251
q21	2967	2377	2284	2284
q22	1028	979	955	955
Total cold run time: 103074 ms
Total hot run time: 31406 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4071	4049	4016	4016
q2	317	390	314	314
q3	2097	2608	2180	2180
q4	1294	1758	1334	1334
q5	4085	3940	4020	3940
q6	213	173	132	132
q7	1848	1824	1705	1705
q8	2758	2491	2441	2441
q9	7384	7230	7369	7230
q10	2608	2808	2270	2270
q11	558	506	478	478
q12	747	802	654	654
q13	3633	4199	3340	3340
q14	295	318	283	283
q15	643	514	502	502
q16	824	696	634	634
q17	1163	1296	1401	1296
q18	7932	7820	7566	7566
q19	901	966	879	879
q20	2030	2016	1982	1982
q21	4891	4630	4281	4281
q22	1173	1076	970	970
Total cold run time: 51465 ms
Total hot run time: 48427 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172802 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bd31370edb4e0d78719d98000f2613682e7a021f, data reload: false

query5	4390	573	432	432
query6	332	250	214	214
query7	4216	476	267	267
query8	336	246	238	238
query9	8773	2662	2656	2656
query10	512	362	332	332
query11	15329	15262	14941	14941
query12	172	116	116	116
query13	1269	506	376	376
query14	6162	2970	2754	2754
query14_1	2692	2689	2624	2624
query15	200	202	175	175
query16	975	479	451	451
query17	1115	677	583	583
query18	2457	441	343	343
query19	230	224	201	201
query20	131	118	120	118
query21	218	141	122	122
query22	3781	3964	3854	3854
query23	15945	15651	15370	15370
query23_1	15603	15466	15417	15417
query24	7467	1546	1169	1169
query24_1	1220	1178	1209	1178
query25	563	510	437	437
query26	1246	287	154	154
query27	2741	446	293	293
query28	4543	2138	2133	2133
query29	809	568	513	513
query30	300	238	209	209
query31	779	629	540	540
query32	79	67	66	66
query33	526	341	281	281
query34	878	879	512	512
query35	723	761	678	678
query36	868	875	802	802
query37	129	89	86	86
query38	2785	2720	2642	2642
query39	776	763	731	731
query39_1	700	720	718	718
query40	225	132	114	114
query41	64	61	62	61
query42	105	99	100	99
query43	450	467	416	416
query44	1311	715	718	715
query45	188	182	174	174
query46	848	958	587	587
query47	1403	1492	1408	1408
query48	309	344	242	242
query49	607	414	363	363
query50	626	272	195	195
query51	3739	3789	3726	3726
query52	102	105	95	95
query53	286	332	271	271
query54	280	253	250	250
query55	74	77	74	74
query56	283	275	286	275
query57	1043	988	957	957
query58	267	255	244	244
query59	2068	2224	2183	2183
query60	310	322	290	290
query61	160	155	156	155
query62	391	362	314	314
query63	301	267	277	267
query64	4922	1302	986	986
query65	3786	3788	3661	3661
query66	1429	420	298	298
query67	15360	14845	14912	14845
query68	4943	1001	725	725
query69	503	340	307	307
query70	1047	930	909	909
query71	359	297	278	278
query72	6127	3521	3448	3448
query73	765	727	313	313
query74	8856	8763	8691	8691
query75	2815	2783	2495	2495
query76	3800	1044	647	647
query77	523	362	279	279
query78	9620	9718	9185	9185
query79	1586	822	593	593
query80	636	550	462	462
query81	507	261	224	224
query82	208	145	109	109
query83	259	249	234	234
query84	257	126	102	102
query85	926	518	450	450
query86	387	325	318	318
query87	2877	2888	2752	2752
query88	3144	2243	2243	2243
query89	390	354	321	321
query90	2194	152	148	148
query91	170	160	148	148
query92	82	67	58	58
query93	1454	919	532	532
query94	579	323	253	253
query95	569	319	305	305
query96	597	471	208	208
query97	2361	2379	2285	2285
query98	232	200	212	200
query99	582	588	516	516
Total cold run time: 251135 ms
Total hot run time: 172802 ms

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Jan 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

PR approved by anyone and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman
Copy link
Contributor

run check_coverage

@yiguolei yiguolei merged commit 99b971c into apache:master Jan 13, 2026
29 of 31 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 13, 2026
…s are missing after schema evolution (#59586)

### What problem does this PR solve?
- relate pr: #57204

**Problem Summary:**
When querying struct fields in Iceberg tables after schema evolution, if
all queried struct fields are missing in old Parquet files, the code
fails with error:
```
File column name 'removed' not found in struct children
```

**Root Cause:**
When all queried struct sub-fields are missing in the old Parquet file
(e.g., newly added fields after schema evolution), the code needs to
find a reference column from the file schema to get repetition level
(RL) and definition level (DL) information. However, if the reference
column (e.g., `removed`) was dropped from the table schema, calling
`root_node->get_children_node_by_file_column_name()` will fail because
the column doesn't exist in `root_node`.

**Scenario:**
1. Create table with struct containing: `removed`, `rename`, `keep`,
`drop_and_add`
2. Insert data (creates Parquet file with these fields)
3. Perform schema evolution: DROP `a_struct.removed`, DROP then ADD
`a_struct.drop_and_add` (gets new field ID), ADD `a_struct.added`
4. Query `struct_element(a_struct, 'drop_and_add')` or
`struct_element(a_struct, 'added')` on the old file
5. The query fails because:
- All queried fields (`drop_and_add`, `added`) are missing in the old
file
- Code tries to use `removed` as reference column (it exists in file but
was dropped from table schema)
- Accessing `removed` via `root_node` fails because it doesn't exist in
table schema

### Solution:
Use `TableSchemaChangeHelper::ConstNode::get_instance()` instead of
looking up from `root_node` for the reference column. Since the
reference column is only used to get RL/DL information (not for schema
mapping), using `ConstNode` is safe and avoids the issue where the
reference column doesn't exist in `root_node`.

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
yiguolei pushed a commit that referenced this pull request Jan 14, 2026
…ueried fields are missing after schema evolution #59586 (#59839)

Cherry-picked from #59586

Co-authored-by: Socrates <suyiteng@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants