Skip to content

[feat](iceberg) change OPTIMIZE TABLE to ALTER TABLE EXECUTE syntax#56638

Merged
morningman merged 2 commits intoapache:masterfrom
suxiaogang223:alter_table_execute
Oct 10, 2025
Merged

[feat](iceberg) change OPTIMIZE TABLE to ALTER TABLE EXECUTE syntax#56638
morningman merged 2 commits intoapache:masterfrom
suxiaogang223:alter_table_execute

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Sep 29, 2025

What problem does this PR solve?

Issue: #56002
Related: #55679

This PR transforms the existing OPTIMIZE TABLE syntax to the more standard ALTER TABLE EXECUTE action syntax. This change provides a unified interface for table action operations across different table engines in Apache Doris.

New ALTER TABLE EXECUTE Syntax

ALTER TABLE [catalog.]database.table 
  EXECUTE action("key1" = "value1", "key2" = "value2", ...) 
  [PARTITION (partition_list)]
  [WHERE condition]

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Sep 29, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 190956 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 72c0e6fb067ead2a0dd3c6f64527ac3a39a957a7, data reload: false

query1	1071	442	412	412
query2	6572	1674	1683	1674
query3	6769	225	227	225
query4	26410	23508	23471	23471
query5	5469	659	487	487
query6	344	251	230	230
query7	4660	494	301	301
query8	317	272	263	263
query9	8745	2582	2571	2571
query10	560	344	297	297
query11	15548	15169	14895	14895
query12	191	118	120	118
query13	1702	570	454	454
query14	12535	9460	9317	9317
query15	256	196	181	181
query16	7845	693	495	495
query17	1621	861	682	682
query18	2154	463	383	383
query19	297	236	206	206
query20	146	142	134	134
query21	231	138	117	117
query22	5004	4889	4628	4628
query23	35014	33936	34001	33936
query24	8268	2514	2453	2453
query25	615	536	452	452
query26	1226	278	181	181
query27	3737	524	375	375
query28	4362	2207	2204	2204
query29	796	611	537	537
query30	318	255	214	214
query31	925	851	790	790
query32	84	73	69	69
query33	584	394	345	345
query34	827	905	577	577
query35	859	873	756	756
query36	1005	1032	972	972
query37	139	113	88	88
query38	3645	3585	3536	3536
query39	1483	1407	1415	1407
query40	219	131	114	114
query41	73	61	60	60
query42	120	115	113	113
query43	492	514	465	465
query44	1332	852	837	837
query45	180	192	176	176
query46	839	996	638	638
query47	1788	1815	1733	1733
query48	397	430	324	324
query49	773	507	422	422
query50	644	694	413	413
query51	3962	3872	3875	3872
query52	113	110	102	102
query53	230	264	240	240
query54	604	586	536	536
query55	85	90	87	87
query56	326	313	311	311
query57	1236	1184	1150	1150
query58	288	280	280	280
query59	2595	2634	2507	2507
query60	342	346	332	332
query61	160	155	153	153
query62	781	731	681	681
query63	231	192	186	186
query64	4357	1147	833	833
query65	4029	3983	3959	3959
query66	1045	426	321	321
query67	15649	15440	15202	15202
query68	9237	958	601	601
query69	529	326	296	296
query70	1386	1319	1243	1243
query71	531	345	331	331
query72	5910	4967	4826	4826
query73	724	575	364	364
query74	8879	9212	8870	8870
query75	4572	3361	2841	2841
query76	3760	1308	742	742
query77	850	484	323	323
query78	9580	9818	8965	8965
query79	2003	849	579	579
query80	688	562	491	491
query81	496	262	236	236
query82	455	164	135	135
query83	299	270	261	261
query84	309	122	89	89
query85	866	473	428	428
query86	350	328	326	326
query87	3848	3715	3662	3662
query88	2849	2243	2240	2240
query89	408	328	278	278
query90	2042	224	210	210
query91	163	161	132	132
query92	83	72	66	66
query93	1163	997	639	639
query94	692	470	334	334
query95	408	319	307	307
query96	493	557	282	282
query97	2956	2973	2871	2871
query98	240	219	216	216
query99	1447	1418	1292	1292
Total cold run time: 283440 ms
Total hot run time: 190956 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.46 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 72c0e6fb067ead2a0dd3c6f64527ac3a39a957a7, data reload: false

query1	0.06	0.04	0.04
query2	0.09	0.06	0.07
query3	0.26	0.09	0.08
query4	1.61	0.12	0.12
query5	0.28	0.27	0.26
query6	1.19	0.65	0.65
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.63	0.53	0.52
query10	0.59	0.58	0.61
query11	0.16	0.12	0.12
query12	0.16	0.12	0.12
query13	0.63	0.64	0.62
query14	1.02	1.05	1.03
query15	0.88	0.85	0.86
query16	0.41	0.39	0.40
query17	1.05	1.05	1.06
query18	0.22	0.20	0.19
query19	1.90	1.78	1.91
query20	0.02	0.02	0.02
query21	15.43	0.97	0.59
query22	0.77	1.14	0.69
query23	14.96	1.37	0.61
query24	7.18	1.33	0.82
query25	0.51	0.12	0.17
query26	0.59	0.16	0.15
query27	0.07	0.05	0.05
query28	9.50	1.36	0.94
query29	12.56	3.99	3.26
query30	0.28	0.14	0.12
query31	2.85	0.60	0.40
query32	3.24	0.56	0.47
query33	3.10	3.07	3.12
query34	16.18	5.50	4.82
query35	4.89	4.95	4.90
query36	0.71	0.54	0.51
query37	0.11	0.08	0.07
query38	0.07	0.05	0.05
query39	0.03	0.04	0.03
query40	0.17	0.17	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 104.61 s
Total hot run time: 30.46 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/9) 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Contributor Author

run external

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 190927 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 68508958e67214d030962bd9399235832ee2e991, data reload: false

query1	1073	443	413	413
query2	6579	1716	1714	1714
query3	6748	222	221	221
query4	26387	23924	23533	23533
query5	5043	673	476	476
query6	352	249	237	237
query7	4657	518	323	323
query8	332	271	249	249
query9	8753	2590	2602	2590
query10	493	355	288	288
query11	15778	15183	14869	14869
query12	213	132	141	132
query13	1677	561	426	426
query14	11994	9353	9355	9353
query15	216	194	176	176
query16	7668	707	557	557
query17	1580	749	609	609
query18	2135	480	362	362
query19	231	223	214	214
query20	133	130	127	127
query21	215	131	140	131
query22	4880	4614	4567	4567
query23	34852	34389	33960	33960
query24	8362	2522	2531	2522
query25	618	564	475	475
query26	1277	275	169	169
query27	3058	524	383	383
query28	4393	2261	2188	2188
query29	808	640	557	557
query30	305	248	219	219
query31	958	864	833	833
query32	91	70	71	70
query33	608	404	346	346
query34	842	885	522	522
query35	800	849	744	744
query36	979	1022	918	918
query37	118	110	85	85
query38	3602	3538	3509	3509
query39	1469	1485	1392	1392
query40	223	127	117	117
query41	61	63	58	58
query42	127	112	108	108
query43	476	495	450	450
query44	1367	831	820	820
query45	186	178	170	170
query46	852	1013	654	654
query47	1744	1795	1767	1767
query48	395	431	317	317
query49	772	521	411	411
query50	681	719	403	403
query51	3890	3941	3902	3902
query52	117	113	104	104
query53	246	272	204	204
query54	604	606	530	530
query55	104	82	88	82
query56	356	321	310	310
query57	1208	1201	1129	1129
query58	286	279	291	279
query59	2561	2617	2642	2617
query60	358	364	333	333
query61	204	188	183	183
query62	824	743	678	678
query63	242	203	198	198
query64	4493	1154	809	809
query65	4105	3947	3981	3947
query66	1086	434	357	357
query67	15435	15365	15193	15193
query68	8002	894	592	592
query69	494	327	291	291
query70	1368	1345	1265	1265
query71	462	344	318	318
query72	5744	5014	4939	4939
query73	656	617	370	370
query74	8855	9108	8723	8723
query75	3916	3365	2806	2806
query76	3535	1157	731	731
query77	832	443	319	319
query78	9551	9706	8850	8850
query79	2515	832	597	597
query80	680	557	493	493
query81	502	266	227	227
query82	228	162	138	138
query83	278	279	247	247
query84	262	123	103	103
query85	891	479	419	419
query86	386	301	297	297
query87	3789	3785	3693	3693
query88	3922	2290	2267	2267
query89	400	325	296	296
query90	2022	220	223	220
query91	162	170	131	131
query92	90	67	64	64
query93	2194	975	648	648
query94	662	483	330	330
query95	408	330	322	322
query96	485	581	284	284
query97	2919	3004	2869	2869
query98	246	210	212	210
query99	1370	1438	1279	1279
Total cold run time: 281106 ms
Total hot run time: 190927 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.62 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 68508958e67214d030962bd9399235832ee2e991, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.09	0.08
query4	1.60	0.12	0.12
query5	0.29	0.26	0.26
query6	1.16	0.67	0.64
query7	0.04	0.03	0.02
query8	0.06	0.04	0.05
query9	0.63	0.55	0.52
query10	0.61	0.58	0.58
query11	0.16	0.12	0.11
query12	0.16	0.12	0.12
query13	0.64	0.64	0.63
query14	1.02	1.02	1.03
query15	0.87	0.87	0.85
query16	0.43	0.40	0.39
query17	1.07	1.07	1.06
query18	0.22	0.19	0.20
query19	1.93	1.86	1.90
query20	0.01	0.02	0.02
query21	15.44	0.93	0.59
query22	0.76	0.97	0.75
query23	15.03	1.39	0.63
query24	6.84	0.90	1.36
query25	0.41	0.10	0.21
query26	0.66	0.16	0.14
query27	0.08	0.05	0.05
query28	9.77	1.39	0.94
query29	12.66	3.95	3.28
query30	0.28	0.14	0.11
query31	2.82	0.61	0.39
query32	3.25	0.56	0.48
query33	3.15	3.10	3.04
query34	16.19	5.51	4.86
query35	4.99	4.88	4.95
query36	0.72	0.53	0.50
query37	0.10	0.08	0.08
query38	0.07	0.05	0.05
query39	0.04	0.02	0.03
query40	0.18	0.15	0.15
query41	0.08	0.04	0.02
query42	0.04	0.03	0.02
query43	0.05	0.03	0.03
Total cold run time: 104.92 s
Total hot run time: 30.62 s

morningman
morningman previously approved these changes Sep 30, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 30, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/9) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/9) 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Contributor Author

run external

@suxiaogang223
Copy link
Contributor Author

run fe-ut

1 similar comment
@suxiaogang223
Copy link
Contributor Author

run fe-ut

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

fix build

add partition

fix regression-test

fix

fix
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 9, 2025
@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

PR approved by at least one committer and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 189086 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ff08d2e6bb766df1a7cea027cd32bb5ebe990fe5, data reload: false

query1	1076	451	418	418
query2	6581	1705	1705	1705
query3	6758	222	216	216
query4	26603	23917	23521	23521
query5	5874	652	491	491
query6	329	249	225	225
query7	4673	487	300	300
query8	309	251	248	248
query9	8731	2588	2566	2566
query10	522	352	304	304
query11	15252	15203	14778	14778
query12	181	114	121	114
query13	1675	564	425	425
query14	12354	9277	9308	9277
query15	239	191	182	182
query16	7749	670	466	466
query17	1578	760	745	745
query18	2074	456	381	381
query19	286	236	197	197
query20	146	140	135	135
query21	232	146	130	130
query22	4786	4719	4630	4630
query23	34995	34174	33914	33914
query24	8606	2450	2539	2450
query25	548	531	468	468
query26	1843	279	162	162
query27	2824	528	380	380
query28	4552	2280	2250	2250
query29	807	646	519	519
query30	342	242	207	207
query31	926	882	839	839
query32	79	71	68	68
query33	621	384	342	342
query34	832	889	543	543
query35	817	917	795	795
query36	1008	1053	945	945
query37	127	123	93	93
query38	3643	3726	3661	3661
query39	1584	1484	1413	1413
query40	222	124	118	118
query41	65	60	57	57
query42	126	114	112	112
query43	476	504	474	474
query44	1326	846	835	835
query45	187	193	179	179
query46	827	984	650	650
query47	1738	1815	1769	1769
query48	393	447	323	323
query49	803	502	409	409
query50	630	676	420	420
query51	3870	3958	3926	3926
query52	107	111	101	101
query53	241	266	194	194
query54	598	581	533	533
query55	94	81	89	81
query56	331	312	311	311
query57	1169	1189	1131	1131
query58	283	276	280	276
query59	2495	2579	2574	2574
query60	355	335	333	333
query61	154	166	153	153
query62	800	729	650	650
query63	234	202	202	202
query64	4347	1134	881	881
query65	4062	3957	3977	3957
query66	1035	451	400	400
query67	15566	15538	15440	15440
query68	7328	942	594	594
query69	500	335	280	280
query70	1438	1338	1337	1337
query71	504	361	327	327
query72	5777	4934	2606	2606
query73	563	566	363	363
query74	9169	9122	8618	8618
query75	3952	3321	2824	2824
query76	3203	1129	729	729
query77	794	405	323	323
query78	9609	9714	8889	8889
query79	2952	851	590	590
query80	694	550	490	490
query81	538	258	232	232
query82	477	164	130	130
query83	294	269	246	246
query84	307	117	94	94
query85	869	454	446	446
query86	357	321	317	317
query87	3800	3693	3721	3693
query88	3320	2210	2253	2210
query89	407	331	312	312
query90	2076	215	213	213
query91	167	168	140	140
query92	78	73	68	68
query93	1761	966	642	642
query94	689	440	339	339
query95	407	323	311	311
query96	489	647	278	278
query97	2938	3009	2939	2939
query98	240	214	214	214
query99	1456	1399	1300	1300
Total cold run time: 281849 ms
Total hot run time: 189086 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ff08d2e6bb766df1a7cea027cd32bb5ebe990fe5, data reload: false

query1	0.06	0.05	0.05
query2	0.08	0.06	0.05
query3	0.25	0.08	0.09
query4	1.60	0.12	0.12
query5	0.27	0.26	0.26
query6	1.19	0.64	0.64
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.63	0.54	0.51
query10	0.58	0.58	0.57
query11	0.16	0.10	0.11
query12	0.16	0.12	0.13
query13	0.63	0.61	0.61
query14	1.03	1.03	1.03
query15	0.88	0.84	0.90
query16	0.40	0.40	0.41
query17	1.03	1.08	1.09
query18	0.22	0.21	0.20
query19	1.97	1.85	1.84
query20	0.02	0.01	0.02
query21	15.44	0.90	0.57
query22	0.75	1.13	0.76
query23	14.87	1.38	0.70
query24	6.68	1.98	0.55
query25	0.47	0.26	0.13
query26	0.64	0.16	0.13
query27	0.07	0.06	0.05
query28	9.65	1.36	0.92
query29	12.55	3.96	3.24
query30	0.27	0.14	0.12
query31	2.82	0.61	0.39
query32	3.24	0.57	0.48
query33	3.00	3.21	3.08
query34	16.08	5.46	4.82
query35	5.00	5.02	4.91
query36	0.69	0.52	0.50
query37	0.10	0.08	0.07
query38	0.06	0.05	0.04
query39	0.04	0.04	0.03
query40	0.18	0.15	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.03	0.03
Total cold run time: 104.02 s
Total hot run time: 30.29 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

@morningman morningman merged commit 759ed47 into apache:master Oct 10, 2025
28 of 29 checks passed
morningman pushed a commit to apache/doris-website that referenced this pull request Nov 10, 2025
morningman pushed a commit that referenced this pull request Nov 10, 2025
…le optimization and compaction (#56413)

### What problem does this PR solve?

**Issue Number:** #56002

**Related PR:** #55679 #56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

## Feature Description

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

## Usage Example

### Basic Usage

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files();
```

### Custom Parameters

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

### Rewrite with WHERE Conditions

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

### Rewrite All Files

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

### Handle Delete Files

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

## Parameter List

### File Size Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

### Input Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

### Delete Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

### Output Specification Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

### Parameter Notes

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

## Execution Flow

### Overall Process

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```

### Detailed Steps

#### Step 1: Parameter Validation and Table Retrieval
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters

#### Step 2: File Planning and Grouping (RewriteDataFilePlanner)
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`

#### Step 3: Concurrent Rewrite Execution (RewriteDataFileExecutor)
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion

#### Step 4: Transaction Commit and Result Return
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 10, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 11, 2025
…pache#56638)

Issue: apache#56002
Related: apache#55679

This PR transforms the existing OPTIMIZE TABLE syntax to the more
standard ALTER TABLE EXECUTE action syntax. This change provides a
unified interface for table action operations across different table
engines in Apache Doris.

```sql
ALTER TABLE [catalog.]database.table
  EXECUTE action("key1" = "value1", "key2" = "value2", ...)
  [PARTITION (partition_list)]
  [WHERE condition]
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 11, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 12, 2025
…pache#56638)

Issue: apache#56002
Related: apache#55679

This PR transforms the existing OPTIMIZE TABLE syntax to the more
standard ALTER TABLE EXECUTE action syntax. This change provides a
unified interface for table action operations across different table
engines in Apache Doris.

```sql
ALTER TABLE [catalog.]database.table
  EXECUTE action("key1" = "value1", "key2" = "value2", ...)
  [PARTITION (partition_list)]
  [WHERE condition]
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 12, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 13, 2025
…pache#56638)

Issue: apache#56002
Related: apache#55679

This PR transforms the existing OPTIMIZE TABLE syntax to the more
standard ALTER TABLE EXECUTE action syntax. This change provides a
unified interface for table action operations across different table
engines in Apache Doris.

```sql
ALTER TABLE [catalog.]database.table
  EXECUTE action("key1" = "value1", "key2" = "value2", ...)
  [PARTITION (partition_list)]
  [WHERE condition]
```
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 13, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 18, 2025
…le optimization and compaction (apache#56413)

### What problem does this PR solve?

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

## Feature Description

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

## Usage Example

### Basic Usage

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files();
```

### Custom Parameters

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

### Rewrite with WHERE Conditions

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

### Rewrite All Files

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

### Handle Delete Files

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

## Parameter List

### File Size Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

### Input Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

### Delete Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

### Output Specification Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

### Parameter Notes

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

## Execution Flow

### Overall Process

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```

### Detailed Steps

#### Step 1: Parameter Validation and Table Retrieval
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters

#### Step 2: File Planning and Grouping (RewriteDataFilePlanner)
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`

#### Step 3: Concurrent Rewrite Execution (RewriteDataFileExecutor)
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion

#### Step 4: Transaction Commit and Result Return
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
yiguolei pushed a commit that referenced this pull request Nov 22, 2025
@yiguolei yiguolei mentioned this pull request Dec 2, 2025
@suxiaogang223 suxiaogang223 deleted the alter_table_execute branch January 17, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants