Skip to content

branch-4.0: [feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction (#56413 #56638)#57871

Merged
yiguolei merged 5 commits intoapache:branch-4.0from
suxiaogang223:rewrite_data_files_4.0
Nov 22, 2025
Merged

branch-4.0: [feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction (#56413 #56638)#57871
yiguolei merged 5 commits intoapache:branch-4.0from
suxiaogang223:rewrite_data_files_4.0

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Nov 10, 2025

@Thearas
Copy link
Contributor

Thearas commented Nov 10, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

1 similar comment
@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@suxiaogang223 suxiaogang223 reopened this Nov 11, 2025
@suxiaogang223 suxiaogang223 changed the title branch-4.0: [feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction (#56413) branch-4.0: [feat](iceberg) Implement Iceberg rewrite_data_files action for table optimization and compaction (#56413 #56638) Nov 11, 2025
@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@suxiaogang223
Copy link
Contributor Author

run buildall

…pache#56638)

Issue: apache#56002
Related: apache#55679

This PR transforms the existing OPTIMIZE TABLE syntax to the more
standard ALTER TABLE EXECUTE action syntax. This change provides a
unified interface for table action operations across different table
engines in Apache Doris.

```sql
ALTER TABLE [catalog.]database.table
  EXECUTE action("key1" = "value1", "key2" = "value2", ...)
  [PARTITION (partition_list)]
  [WHERE condition]
```
…le optimization and compaction (apache#56413)

**Issue Number:** apache#56002

**Related PR:** apache#55679 apache#56638

This PR implements the `rewrite_data_files` action for Apache Iceberg
tables in Doris, providing comprehensive table optimization and data
file compaction capabilities. This feature allows users to reorganize
data files to improve query performance, optimize storage efficiency,
and maintain delete files according to Iceberg's official specification.

---

This PR implements the `rewrite_data_files` operation for Iceberg
tables, providing table optimization and data file compaction
capabilities. The feature follows Iceberg's official `RewriteDataFiles`
specification and provides the following core capabilities:

1. **Data File Compaction**: Merges multiple small files into larger
files, reducing file count and improving query performance
2. **Storage Efficiency Optimization**: Reduces storage overhead through
file reorganization and optimizes data distribution
3. **Delete File Management**: Properly handles and maintains delete
files, reducing filtering overhead during queries
4. **WHERE Condition Support**: Supports rewriting specific data ranges
through WHERE conditions, including various data types (BIGINT, STRING,
INT, DOUBLE, BOOLEAN, DATE, TIMESTAMP, DECIMAL) and complex conditional
expressions
5. **Concurrent Execution**: Supports concurrent execution of multiple
rewrite tasks for improved processing efficiency

After execution, detailed statistics are returned, including:
- `rewritten_data_files_count`: Number of data files that were rewritten
- `added_data_files_count`: Number of new data files generated
- `rewritten_bytes_count`: Number of bytes rewritten
- `removed_delete_files_count`: Number of delete files removed

---

```sql
-- Rewrite data files with default parameters
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files();
```

```sql
-- Specify target file size and minimum input files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3"
);
```

```sql
-- Rewrite only data within specific date range
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "104857600",
    "min-input-files" = "3",
    "delete-ratio-threshold" = "0.2"
) WHERE created_date >= '2024-01-01' AND status = 'active';

-- Rewrite data satisfying complex conditions
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "target-file-size-bytes" = "536870912"
) WHERE age > 25 AND salary > 50000.0 AND is_active = true;
```

```sql
-- Ignore file size limits and rewrite all files
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files("rewrite-all" = "true");
```

```sql
-- Trigger rewrite when delete file count or ratio exceeds threshold
ALTER TABLE iceberg_catalog.db.table
EXECUTE rewrite_data_files(
    "delete-file-threshold" = "10",
    "delete-ratio-threshold" = "0.3"
);
```

---

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `target-file-size-bytes` | Long | 536870912 (512MB) | Target size in
bytes for output files |
| `min-file-size-bytes` | Long | 0 (auto-calculated as 75% of target) |
Minimum file size in bytes for files to be rewritten |
| `max-file-size-bytes` | Long | 0 (auto-calculated as 180% of target) |
Maximum file size in bytes for files to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `min-input-files` | Int | 5 | Minimum number of input files to rewrite
together |
| `rewrite-all` | Boolean | false | Whether to rewrite all files
regardless of size |
| `max-file-group-size-bytes` | Long | 107374182400 (100GB) | Maximum
size in bytes for a file group to be rewritten |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `delete-file-threshold` | Int | Integer.MAX_VALUE | Minimum number of
delete files to trigger rewrite |
| `delete-ratio-threshold` | Double | 0.3 | Minimum ratio of delete
records to total records to trigger rewrite (0.0-1.0) |

| Parameter Name | Type | Default Value | Description |
|----------------|------|---------------|-------------|
| `output-spec-id` | Long | 2 | Partition specification ID for output
files |

- If `min-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 0.75`
- If `max-file-size-bytes` is not specified, default value is
`target-file-size-bytes * 1.8`
- File groups are only rewritten when they meet the `min-input-files`
condition
- `delete-file-threshold` and `delete-ratio-threshold` are used to
determine if rewrite is needed to handle delete files

---

```
1. Parameter Validation and Table Retrieval
   ├─ Validate rewrite parameters
   ├─ Get Iceberg table reference
   └─ Check if table has data snapshots

2. File Planning and Grouping
   ├─ Use RewriteDataFilePlanner to plan file scan tasks
   ├─ Filter file scan tasks based on WHERE conditions
   ├─ Organize file groups by partition and size constraints
   └─ Filter file groups that don't meet rewrite conditions

3. Concurrent Rewrite Execution
   ├─ Create RewriteDataFileExecutor
   ├─ Execute multiple file group rewrite tasks concurrently
   ├─ Each task executes INSERT-SELECT statements
   └─ Wait for all tasks to complete

4. Transaction Commit and Result Return
   ├─ Commit transaction and create new snapshot
   ├─ Update table metadata
   └─ Return detailed execution result statistics
```
- Validate all parameters for validity and value ranges
- If table has no snapshots, return empty result directly
- Calculate default values for `min-file-size-bytes` and
`max-file-size-bytes` based on parameters
- **File Scanning**: Build `TableScan` based on WHERE conditions to get
qualified `FileScanTask`
- **File Filtering**: Filter files based on `min-file-size-bytes`,
`max-file-size-bytes`, and `rewrite-all` parameters
- **Partition Grouping**: Group files into `RewriteDataGroup` by
partition specification
- **Size Constraints**: Ensure each file group doesn't exceed
`max-file-group-size-bytes`
- **Delete File Check**: Determine if rewrite is needed based on
`delete-file-threshold` and `delete-ratio-threshold`
- **Task Creation**: Create `RewriteGroupTask` for each
`RewriteDataGroup`
- **Concurrent Execution**: Use thread pool to execute multiple rewrite
tasks concurrently
- **Data Writing**: Each task executes `INSERT INTO ... SELECT FROM ...`
statements to write data to new files
- **Progress Tracking**: Use atomic counters and `CountDownLatch` to
track task completion
- **Transaction Management**: Use `IcebergTransaction` to manage
transactions, ensuring atomicity
- **Metadata Update**: Commit transaction to create new snapshot and
update table metadata
- **Result Statistics**: Aggregate execution results from all tasks and
return statistics
@suxiaogang223
Copy link
Contributor Author

run buildall

yiguolei
yiguolei previously approved these changes Nov 13, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 13, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 27.89% (227/814) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.83% (18155/34364)
Line Coverage 38.16% (165237/433014)
Region Coverage 33.25% (128531/386559)
Branch Coverage 33.99% (55383/162918)

@suxiaogang223
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 18, 2025
@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 83.49% (1573/1884)
Line Coverage 67.66% (28053/41459)
Region Coverage 68.15% (13809/20264)
Branch Coverage 58.39% (7363/12610)

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/7) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.80% (18158/34389)
Line Coverage 38.13% (165311/433497)
Region Coverage 33.16% (128459/387335)
Branch Coverage 33.95% (55385/163114)

@yiguolei yiguolei merged commit 66a9203 into apache:branch-4.0 Nov 22, 2025
22 of 25 checks passed
@suxiaogang223 suxiaogang223 deleted the rewrite_data_files_4.0 branch January 17, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants