Skip to content

[feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction#56413

Merged
morningman merged 4 commits intoapache:masterfrom
suxiaogang223:impl_iceberg_rewrite
Nov 10, 2025
Merged

[feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction#56413
morningman merged 4 commits intoapache:masterfrom
suxiaogang223:impl_iceberg_rewrite

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Sep 24, 2025

What problem does this PR solve?

Issue Number: #56002

Related PR: #55679 #56638

This PR implements the rewrite_data_files action for Apache Iceberg tables in Doris, providing comprehensive table optimization and data file compaction capabilities. This feature allows users to reorganize data files to improve query performance, optimize storage efficiency, and maintain delete files according to Iceberg's official specification.


Feature Description

This PR implements the rewrite_data_files operation for Iceberg tables, providing table optimization and data file compaction capabilities. The feature follows Iceberg's official RewriteDataFiles specification and provides the following core capabilities:

  1. Data File Compaction: Merges multiple small files into larger files, reducing file count and improving query performance
  2. Storage Efficiency Optimization: Reduces storage overhead through file reorganization and optimizes data distribution
  3. Delete File Management: Properly handles and maintains delete files, reducing filtering overhead during queries
  4. WHERE Condition Support: Supports rewriting specific data ranges through WHERE conditions, including various data types (BIGINT, STRING, INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional expressions
  5. Concurrent Execution: Supports concurrent execution of multiple rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:

  • rewritten_data_files_count: Number of data files that were rewritten
  • added_data_files_count: Number of new data files generated
  • rewritten_bytes_count: Number of bytes rewritten
  • removed_delete_files_count: Number of delete files removed

Usage Example

Basic Usage

-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files();

Custom Parameters

-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);

Rewrite with WHERE Conditions

-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;

Rewrite All Files

-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files("rewrite-all" = "true");

Handle Delete Files

-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);

Parameter List

File Size Parameters

Parameter Name Type Default Value Description
target-file-size-bytes Long 536870912 (512MB) Target size in bytes for output files
min-file-size-bytes Long 0 (auto-calculated as 75% of target) Minimum file size in bytes for files to be rewritten
max-file-size-bytes Long 0 (auto-calculated as 180% of target) Maximum file size in bytes for files to be rewritten

Input Files Parameters

Parameter Name Type Default Value Description
min-input-files Int 5 Minimum number of input files to rewrite together
rewrite-all Boolean false Whether to rewrite all files regardless of size
max-file-group-size-bytes Long 107374182400 (100GB) Maximum size in bytes for a file group to be rewritten

Delete Files Parameters

Parameter Name Type Default Value Description
delete-file-threshold Int Integer.MAX_VALUE Minimum number of delete files to trigger rewrite
delete-ratio-threshold Double 0.3 Minimum ratio of delete records to total records to trigger rewrite (0.0-1.0)

Output Specification Parameters

Parameter Name Type Default Value Description
output-spec-id Long 2 Partition specification ID for output files

Parameter Notes

  • If min-file-size-bytes is not specified, default value is target-file-size-bytes * 0.75
  • If max-file-size-bytes is not specified, default value is target-file-size-bytes * 1.8
  • File groups are only rewritten when they meet the min-input-files condition
  • delete-file-threshold and delete-ratio-threshold are used to determine if rewrite is needed to handle delete files

Execution Flow

Overall Process

1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics

Detailed Steps

Step 1: Parameter Validation and Table Retrieval

  • Validate all parameters for validity and value ranges
  • If table has no snapshots, return empty result directly
  • Calculate default values for min-file-size-bytes and max-file-size-bytes based on parameters

Step 2: File Planning and Grouping (RewriteDataFilePlanner)

  • File Scanning: Build TableScan based on WHERE conditions to get qualified FileScanTask
  • File Filtering: Filter files based on min-file-size-bytes, max-file-size-bytes, and rewrite-all parameters
  • Partition Grouping: Group files into RewriteDataGroup by partition specification
  • Size Constraints: Ensure each file group doesn't exceed max-file-group-size-bytes
  • Delete File Check: Determine if rewrite is needed based on delete-file-threshold and delete-ratio-threshold

Step 3: Concurrent Rewrite Execution (RewriteDataFileExecutor)

  • Task Creation: Create RewriteGroupTask for each RewriteDataGroup
  • Concurrent Execution: Use thread pool to execute multiple rewrite tasks concurrently
  • Data Writing: Each task executes INSERT INTO ... SELECT FROM ... statements to write data to new files
  • Progress Tracking: Use atomic counters and CountDownLatch to track task completion

Step 4: Transaction Commit and Result Return

  • Transaction Management: Use IcebergTransaction to manage transactions, ensuring atomicity
  • Metadata Update: Commit transaction to create new snapshot and update table metadata
  • Result Statistics: Aggregate execution results from all tasks and return statistics

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@suxiaogang223 suxiaogang223 marked this pull request as draft September 24, 2025 15:18
@suxiaogang223 suxiaogang223 force-pushed the impl_iceberg_rewrite branch 4 times, most recently from 962ad3d to d46f90b Compare October 15, 2025 07:33
@suxiaogang223 suxiaogang223 marked this pull request as ready for review October 16, 2025 08:18
@suxiaogang223 suxiaogang223 force-pushed the impl_iceberg_rewrite branch 3 times, most recently from a6525d4 to 08dd72f Compare October 22, 2025 09:32
@suxiaogang223 suxiaogang223 changed the title [feat](iceberg) impl IcebergRewriteDataFilesAction [feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction Oct 23, 2025
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.64% (1649/2045)
Line Coverage 66.98% (29094/43437)
Region Coverage 67.34% (14422/21418)
Branch Coverage 57.67% (7666/13292)

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.75% (18058/34231)
Line Coverage 37.99% (163735/431029)
Region Coverage 32.34% (124714/385646)
Branch Coverage 33.73% (54563/161775)

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.64% (1649/2045)
Line Coverage 66.98% (29093/43437)
Region Coverage 67.33% (14420/21418)
Branch Coverage 57.68% (7667/13292)

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.75% (18058/34231)
Line Coverage 37.99% (163734/431029)
Region Coverage 32.36% (124793/385646)
Branch Coverage 33.73% (54566/161775)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 58.41% (19590/33536)
Line Coverage 43.84% (188881/430837)
Region Coverage 38.37% (149820/390411)
Branch Coverage 39.48% (64174/162558)

@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.66% (1656/2053)
Line Coverage 66.97% (29215/43625)
Region Coverage 67.35% (14494/21521)
Branch Coverage 57.66% (7706/13364)

@doris-robot
Copy link

ClickBench: Total hot run time: 28.9 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit cdb2ba1b565b4c1353ba6b7d5b40b422e4e5569d, data reload: false

query1	0.05	0.05	0.05
query2	0.12	0.08	0.07
query3	0.31	0.08	0.07
query4	1.60	0.08	0.08
query5	0.27	0.25	0.26
query6	1.18	0.65	0.66
query7	0.03	0.03	0.02
query8	0.07	0.06	0.07
query9	0.68	0.55	0.54
query10	0.60	0.60	0.59
query11	0.27	0.14	0.14
query12	0.30	0.15	0.15
query13	0.66	0.63	0.63
query14	1.02	1.04	1.06
query15	0.96	0.86	0.86
query16	0.39	0.40	0.39
query17	1.09	1.07	1.04
query18	0.24	0.23	0.23
query19	2.03	1.86	1.84
query20	0.02	0.01	0.01
query21	15.39	0.29	0.25
query22	5.02	0.12	0.10
query23	15.37	0.40	0.24
query24	2.77	0.54	0.35
query25	0.11	0.09	0.10
query26	0.19	0.18	0.18
query27	0.09	0.09	0.09
query28	3.70	1.29	1.08
query29	12.55	4.08	3.47
query30	0.33	0.12	0.10
query31	2.85	0.65	0.44
query32	3.24	0.62	0.51
query33	3.13	3.09	3.10
query34	16.55	5.13	4.63
query35	4.60	4.59	4.54
query36	0.66	0.54	0.52
query37	0.24	0.09	0.09
query38	0.20	0.06	0.07
query39	0.06	0.05	0.05
query40	0.22	0.18	0.16
query41	0.11	0.07	0.06
query42	0.07	0.04	0.05
query43	0.07	0.06	0.05
Total cold run time: 99.41 s
Total hot run time: 28.9 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.76% (18065/34243)
Line Coverage 37.99% (163843/431229)
Region Coverage 32.36% (124846/385852)
Branch Coverage 33.73% (54614/161933)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 58.50% (19627/33548)
Line Coverage 43.97% (189533/431041)
Region Coverage 38.56% (150640/390614)
Branch Coverage 39.59% (64426/162716)

@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 80.66% (1656/2053)
Line Coverage 66.99% (29224/43625)
Region Coverage 67.39% (14504/21521)
Branch Coverage 57.68% (7709/13364)

@doris-robot
Copy link

ClickBench: Total hot run time: 28.87 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 5476cdf3396bf557e343f29cfa872ed52dbbadc2, data reload: false

query1	0.06	0.04	0.04
query2	0.12	0.07	0.08
query3	0.31	0.07	0.07
query4	1.61	0.08	0.08
query5	0.28	0.26	0.26
query6	1.15	0.66	0.67
query7	0.03	0.02	0.02
query8	0.08	0.06	0.06
query9	0.66	0.56	0.54
query10	0.60	0.62	0.59
query11	0.27	0.14	0.14
query12	0.27	0.15	0.15
query13	0.66	0.63	0.63
query14	1.03	1.05	1.05
query15	0.95	0.87	0.87
query16	0.41	0.39	0.41
query17	1.07	1.07	1.11
query18	0.25	0.23	0.23
query19	2.03	1.84	1.89
query20	0.02	0.01	0.02
query21	15.39	0.29	0.25
query22	5.00	0.10	0.11
query23	15.36	0.39	0.24
query24	2.91	0.51	0.34
query25	0.10	0.09	0.09
query26	0.20	0.18	0.18
query27	0.09	0.09	0.10
query28	3.67	1.28	1.06
query29	12.57	4.22	3.46
query30	0.35	0.14	0.11
query31	2.85	0.64	0.44
query32	3.23	0.61	0.52
query33	3.10	3.08	3.11
query34	16.88	5.17	4.54
query35	4.60	4.55	4.58
query36	0.65	0.53	0.52
query37	0.23	0.08	0.08
query38	0.20	0.06	0.07
query39	0.06	0.06	0.05
query40	0.21	0.20	0.18
query41	0.12	0.07	0.06
query42	0.06	0.05	0.05
query43	0.07	0.06	0.05
Total cold run time: 99.76 s
Total hot run time: 28.87 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.75% (18066/34250)
Line Coverage 37.98% (163838/431352)
Region Coverage 32.36% (124956/386181)
Branch Coverage 33.72% (54618/161993)

@suxiaogang223
Copy link
Contributor Author

run external

@doris-robot
Copy link

ClickBench: Total hot run time: 28.16 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7b96b1219a128d2c6186350f9edaaa981d626728, data reload: false

query1	0.05	0.04	0.04
query2	0.12	0.06	0.06
query3	0.29	0.07	0.07
query4	1.61	0.09	0.09
query5	0.26	0.26	0.25
query6	1.16	0.66	0.65
query7	0.04	0.03	0.03
query8	0.07	0.06	0.07
query9	0.65	0.54	0.53
query10	0.58	0.57	0.58
query11	0.26	0.15	0.13
query12	0.26	0.14	0.14
query13	0.64	0.63	0.62
query14	1.04	1.03	1.04
query15	0.94	0.86	0.85
query16	0.40	0.39	0.38
query17	1.03	1.03	1.04
query18	0.23	0.22	0.23
query19	1.97	1.86	1.78
query20	0.02	0.01	0.02
query21	15.40	0.28	0.24
query22	4.98	0.09	0.10
query23	15.37	0.38	0.23
query24	2.84	0.49	0.31
query25	0.10	0.08	0.09
query26	0.19	0.18	0.18
query27	0.10	0.09	0.09
query28	3.64	1.26	1.07
query29	12.54	4.06	3.32
query30	0.33	0.12	0.10
query31	2.83	0.63	0.44
query32	3.23	0.60	0.51
query33	3.12	3.03	3.15
query34	16.35	5.21	4.44
query35	4.63	4.50	4.59
query36	0.63	0.52	0.50
query37	0.22	0.09	0.09
query38	0.19	0.05	0.06
query39	0.06	0.04	0.04
query40	0.21	0.18	0.16
query41	0.11	0.07	0.06
query42	0.07	0.04	0.05
query43	0.06	0.05	0.05
Total cold run time: 98.82 s
Total hot run time: 28.16 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.78% (18226/34533)
Line Coverage 38.13% (165769/434735)
Region Coverage 33.13% (128955/389199)
Branch Coverage 33.87% (55322/163332)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.49% (24270/33947)
Line Coverage 58.00% (252563/435455)
Region Coverage 53.46% (211001/394720)
Branch Coverage 54.70% (89937/164415)

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 60.30% (486/806) 🎉
Increment coverage report
Complete coverage report

morningman pushed a commit to apache/doris-website that referenced this pull request Nov 10, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 10, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit b985364 into apache:master Nov 10, 2025
29 of 31 checks passed
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 10, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 11, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 12, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Nov 13, 2025
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 18, 2025
…le optimization and compaction (apache#56413)

### What problem does this PR solve?

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

## Feature Description

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

## Usage Example

### Basic Usage

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files();
```

### Custom Parameters

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

### Rewrite with WHERE Conditions

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

### Rewrite All Files

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

### Handle Delete Files

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table 
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

## Parameter List

### File Size Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

### Input Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

### Delete Files Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

### Output Specification Parameters

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

### Parameter Notes

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

## Execution Flow

### Overall Process

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```

### Detailed Steps

#### Step 1: Parameter Validation and Table Retrieval
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters

#### Step 2: File Planning and Grouping (RewriteDataFilePlanner)
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`

#### Step 3: Concurrent Rewrite Execution (RewriteDataFileExecutor)
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion

#### Step 4: Transaction Commit and Result Return
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
yiguolei pushed a commit that referenced this pull request Nov 22, 2025
@yiguolei yiguolei mentioned this pull request Dec 2, 2025
@suxiaogang223 suxiaogang223 deleted the impl_iceberg_rewrite branch January 17, 2026 16:02
@morningman morningman mentioned this pull request Jan 19, 2026
74 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants