Add ParquetFileMerger for efficient row-group level file merging #14435

shangxinli · 2025-10-28T17:50:47Z

Why this change?

This implementation provides significant performance improvements for Parquet
file merging operations by eliminating serialization/deserialization overhead.
Benchmark results show 10x faster file merging compared to traditional
read-rewrite approaches.

The change leverages existing Parquet library capabilities (ParquetFileWriter
appendFile API) to perform zero-copy row-group merging, making it ideal for
compaction and maintenance operations on large Iceberg tables.

TODO: 1) Encrypted tables are not supported yet. 2) Schema evolution is not handled yet

What changed?

Added ParquetFileMerger class for row-group level file merging
- Performs zero-copy merging using ParquetFileWriter.appendFile()
- Validates schema compatibility across all input files
- Supports merging multiple Parquet files into a single output file
Reuses existing Apache Parquet library functionality instead of custom implementation
Strict schema validation ensures data integrity during merge operations
Added comprehensive error handling for schema mismatches

Testing

Validated in staging test environment
Verified schema compatibility checks work correctly
Confirmed 13x performance improvement over traditional approach
Tested with various file sizes and row group configurations

## Why this change? This implementation provides significant performance improvements for Parquet file merging operations by eliminating serialization/deserialization overhead. Benchmark results show **13x faster** file merging compared to traditional read-rewrite approaches. The change leverages existing Parquet library capabilities (ParquetFileWriter appendFile API) to perform zero-copy row-group merging, making it ideal for compaction and maintenance operations on large Iceberg tables. TODO: 1) Encrypted tables are not supported yet. 2) Schema evolution is not handled yet ## What changed? - Added ParquetFileMerger class for row-group level file merging - Performs zero-copy merging using ParquetFileWriter.appendFile() - Validates schema compatibility across all input files - Supports merging multiple Parquet files into a single output file - Reuses existing Apache Parquet library functionality instead of custom implementation - Strict schema validation ensures data integrity during merge operations - Added comprehensive error handling for schema mismatches ## Testing - Validated in staging test environment - Verified schema compatibility checks work correctly - Confirmed 13x performance improvement over traditional approach - Tested with various file sizes and row group configurations

huaxingao · 2025-11-03T06:10:28Z

Thanks @shangxinli for the PR! At a high level, leveraging Parquet’s appendFile for row‑group merging is the right approach and a performance win. Making it opt‑in via an action option and a table property is appropriate.

A couple of areas I’d like to discuss:

IO integration: Would it make sense to route IO through table.io()/OutputFileFactory rather than Hadoop IO?
Executor/driver split: Should executors only write files and return locations/sizes, with DataFiles (and metrics) constructed on the driver?

I’d also like to get others’ opinions. @pvary @amogh-jahagirdar @nastra @singhpk234

core/src/main/java/org/apache/iceberg/TableProperties.java

pvary · 2025-11-03T15:33:25Z

I have a few concerns here:

I would prefer if the decision to do the row-group level merging is done on the action level, and not leaked to the table properties
I would prefer to check the requirements as soon as possible and fail, or fall back with logging to the normal rewrite if the requirements are not met
In the planning we can create groups with the expected sizes, and in this case the runner could rewrite the whole groups, and don't need to split the planned groups to the expected file sizes
Always using HadoopFileIO could be problematic. The catalog might define a different FileIO implementation. We should handle the case correctly, and use the Catalog/Table provided FileIo
We don't reuse the ParquetFileMerger object. In this case, I usually prefer to use static methods.

shangxinli · 2025-11-04T15:40:51Z

Thanks @huaxingao for the review and feedback! I've addressed both of your points.

IO integration:
Good catch! I've updated the implementation to use table.io() instead of hardcoding HadoopFileIO. The new approach:

Executors still use Hadoop Configuration for the actual Parquet file merging (since ParquetFileMerger internally uses Parquet's appendFile which requires Hadoop APIs)
Driver now uses table.io().newInputFile() to read metrics, which properly respects the catalog's configured FileIO implementation
This ensures compatibility with different storage systems (S3, GCS, Azure, custom FileIO implementations)

Executor/driver split:
I've refactored to follow the recommended pattern:

Executors: Only perform the file merge operation and return lightweight metadata (file path, size) via a MergeResult object
Driver: Receives the metadata, reads metrics using table.io(), and constructs the full DataFile objects
This minimizes serialization overhead and keeps heavyweight objects on the driver side

Additionally, I've addressed the architectural feedback about planning vs. execution:

Removed the groupFilesBySize() logic from the runner - the planner already creates appropriately-sized groups
Runner now merges the entire file group into a single output file without further splitting
This creates a cleaner separation where planning happens in the planner and execution happens in the runner

Let me know if there are any other concerns or improvements you'd like to see!

shangxinli · 2025-11-04T15:45:30Z

Thanks @pvary for the detailed feedback! I've addressed your points:

Decision at action level, not table properties:
Done - Removed the PARQUET_USE_FILE_MERGER from TableProperties entirely and stripped out the fallback logic. Now it only checks the action options.
Early validation with proper fallback:
Done - Flipped the logic around to validate upfront with canUseMerger() before attempting the merge. Also beefed up the validation to actually check schema compatibility, not just file format. If anything fails, it logs and falls back to the standard rewrite.
Planning creates expected sizes, runner doesn't split:
Done - Nuked the whole groupFilesBySize() method. The runner now just merges whatever the planner gave it into a single file - no more re-grouping.
Use Catalog/Table FileIO instead of HadoopFileIO:
Done - Removed HadoopFileIO completely. Executors now just return path + size, and the driver reads metrics using table().io() which respects whatever FileIO the catalog configured.
Static methods instead of object creation:
Done - Converted ParquetFileMerger to a utility class with private constructor and all static methods. No more new ParquetFileMerger() calls anywhere.

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

Guosmilesmile · 2025-11-12T12:01:34Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

+            parquetOutputFile,
+            schema,
+            ParquetFileWriter.Mode.CREATE,
+            ParquetWriter.DEFAULT_BLOCK_SIZE,


I'm not very sure about this part. Should it be directly fixed to ParquetWriter.DEFAULT_BLOCK_SIZE, or does it need to be linked with the table property write.parquet.row-group-size-bytes?

I feel using the property gives more flexibility for rewriter. But like to hear other's thoughts.

Guosmilesmile · 2025-11-12T13:17:42Z

Thanks for the PR ! I have a question about the lineage, If the merging is only performed at the parquet layer, will the lineage information of the v3 table be disrupted?

shangxinli · 2025-11-16T01:11:49Z

Thanks for the PR ! I have a question about the lineage, If the merging is only performed at the parquet layer, will the lineage information of the v3 table be disrupted?

Good question! The lineage information for v3 tables is preserved in two ways:

Field IDs (Schema Lineage)

Field IDs are preserved because we strictly enforce identical schemas across all files being merged.

In ParquetFileMerger.java:130-136, we validate that all input files have exactly the same Parquet MessageType schema:

if (!schema.equals(currentSchema)) {
throw new IllegalArgumentException(
String.format("Schema mismatch detected: file '%s' has schema %s but file '%s' has schema %s. "
+ "All files must have identical Parquet schemas for row-group level merging.", ...));
}

Field IDs are stored directly in the Parquet schema structure itself (via Type.getId()), so when we copy row groups using ParquetFileWriter.appendFile() with the validated schema, all field IDs are preserved.

Row IDs (Row Lineage for v3+)

Row IDs are automatically assigned by Iceberg's commit framework - we don't need special handling in the merger.

Here's how it works:

Our code creates DataFile objects with metrics (including recordCount) but without firstRowId - see SparkParquetFileMergeRunner.java:236-243
During commit, SnapshotProducer creates a ManifestListWriter initialized with base.nextRowId() (the table's current row ID counter) - see SnapshotProducer.java:273
ManifestListWriter.prepare() automatically assigns firstRowId to each manifest and increments the counter by the number of rows - see ManifestListWriter.java:136-140:
// assign first-row-id and update the next to assign
wrapper.wrap(manifest, nextRowId);
this.nextRowId += manifest.existingRowsCount() + manifest.addedRowsCount();
The snapshot is committed with the updated nextRowId, ensuring all row IDs are correctly tracked

This is the same mechanism used by all Iceberg write operations, so row lineage is fully preserved for v3 tables.

pvary · 2026-01-06T14:59:25Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+    // Merge using ParquetFileMerger - should handle all partitions correctly
+    RewriteDataFiles.Result result =
+        basicRewrite(table)
+            .option(RewriteDataFiles.USE_PARQUET_ROW_GROUP_MERGE, "true")


Please think through all of the cases where we explicitly set USE_PARQUET_ROW_GROUP_MERGE. Should we set it based on useParquetFileMerger? If we don't want to set it based on useParquetFileMerger, do we need to run the tests 2 times with useParquetFileMerger true and false as well?

Agree. let's use useParquetFileMerger

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

pvary · 2026-01-06T15:05:21Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+    // This validation would normally be done in SparkParquetFileMergeRunner.canMergeAndGetSchema
+    // but we're testing the sort order check that happens before calling ParquetFileMerger
+    // Since table has sort order, validation should fail early
+    if (table.sortOrder().isSorted()) {
+      // Should fail due to sort order
+      assertThat(true).isTrue();
+    } else {
+      // If we got here, the sort order check didn't work
+      assertThat(false).isTrue();
+    }


I don't get this.
What do we test here?

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

pvary · 2026-01-06T15:14:34Z

@shangxinli: Back from the holidays! I'm fairly happy with the production part of the code. Left a few small comments. I'm happy that @SourabhBadhya is checking the Sparks side.

There are still a few questions wrt the tests. Please check them if you have time.

Thanks, and happy new year 🎉 !

1. Remove Spark-specific javadoc constraints from ParquetFileMerger - Removed "Files must not have associated delete files" constraint - Removed "Table must not have a sort order" constraint - These validations are only enforced in SparkParquetFileMergeRunner, not in the ParquetFileMerger class itself 2. Fix code style in TestParquetFileMerger - Replace 'var' with explicit types (Parquet.DataWriteBuilder, DataWriter<Record>) - Add newlines after for loop and try-catch blocks for better readability - Remove unused Parquet import 3. Optimize test execution in TestRewriteDataFilesAction - Add assumeThat for comparison tests to run once instead of twice - Use String.valueOf(useParquetFileMerger) for regular tests to test both approaches - Remove redundant testParquetFileMergerExplicitlyEnabledAndDisabled test 4. Fix TestSparkParquetFileMergeRunner to actually call canMergeAndGetSchema - Changed canMergeAndGetSchema from private to package-private in SparkParquetFileMergeRunner - Updated all tests to create runner instance and call canMergeAndGetSchema() - Removed 4 trivial tests (description, inheritance, validOptions, init) - All remaining tests now validate actual canMergeAndGetSchema behavior

SourabhBadhya · 2026-01-09T08:29:35Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

+
+  private static MessageType readSchema(InputFile inputFile) throws IOException {
+    try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(inputFile))) {
+      return reader.getFooter().getFileMetaData().getSchema();


nit: use reader.getFileMetaData().getSchema().

SourabhBadhya · 2026-01-09T08:29:49Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

+        try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(inputFile))) {
+          // Read metadata from the first file
+          if (extraMetadata == null) {
+            extraMetadata = reader.getFooter().getFileMetaData().getKeyValueMetaData();


nit: use reader.getFileMetaData().getKeyValueMetaData()

pvary · 2026-01-09T09:54:04Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+    shouldHaveFiles(table, 4);
+
+    // Test that binPack() respects the configuration option
+    // When enabled, should use SparkParquetFileMergeRunner


How do we check that the correct merger is used?

Good point! I've added verification for this.

pvary · 2026-01-09T09:59:38Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+    // Add more data to create additional files
+    writeRecords(2, SCALE);
+    shouldHaveFiles(table, dataFilesAfterFirstMerge.size() + 2);
+
+    long countBefore = currentData().size();
+
+    // Second merge: should preserve physical row IDs via binary copy


Is this testing a different feature? If so, this could be a separate test.
The first test is for generating the _row_ids, the second is copying the _row_ids.

Do I understand the intention correctly?

Yeah, the test was covering two different features together. I've split them.

testParquetFileMergerGeneratesPhysicalRowIds()

testParquetFileMergerPreservesPhysicalRowIds()

pvary · 2026-01-09T10:01:48Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+
+  @TestTemplate
+  public void testRowLineageWithPartitionedTable() throws IOException {
+    // Test that row lineage preservation works correctly with partitioned tables


These are descriptions of what we are testing. Should be a comment for the test method and not a comment in the first line of the test method.
Please check and fix in the other case as well

Good catch! I've moved all test descriptions from inline comments to proper JavaDoc comments above the @testtemplate annotations. Fixed in:

testParquetFileMergerProduceConsistentRowLineageWithBinPackMerger()

testParquetFileMergerGeneratesPhysicalRowIds()

testParquetFileMergerPreservesPhysicalRowIds()

testRowLineageWithPartitionedTable()

pvary · 2026-01-09T10:02:59Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+
+  @TestTemplate
+  public void testRowLineageWithLargerScale() throws IOException {
+    // Test row lineage preservation with larger number of files


Why is this test important in a unit test?
Do we have multiple groups, and we are checking that we are able to handle multiple groups correctly?
If so, we should check that we actually had multiple groups

You're right. Let me remove it.

pvary · 2026-01-09T10:05:10Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java

+
+  @TestTemplate
+  public void testRowLineageConsistencyAcrossMultipleMerges() throws IOException {
+    // Test that row lineage (row IDs) are preserved across multiple merge operations


I think it is enough to test:

Generation of the _row_id and related columns

Copy of the _row_id and related columns

Copy of the data columns if _row_id is not needed

This kind of test seems like a duplication which uses time and resources for limited coverage gain

pvary · 2026-01-09T10:19:28Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+    when(parquetFile1.format()).thenReturn(FileFormat.PARQUET);
+    when(parquetFile1.specId()).thenReturn(0);
+    when(parquetFile1.fileSizeInBytes()).thenReturn(100L);
+    when(parquetFile1.path()).thenReturn(tableLocation + "/data/file1.parquet");
+    when(group.rewrittenFiles()).thenReturn(Sets.newHashSet(parquetFile1));
+    when(group.expectedOutputFiles()).thenReturn(1);
+    when(group.maxOutputFileSize()).thenReturn(Long.MAX_VALUE);
+    when(group.fileScanTasks()).thenReturn(Collections.emptyList());


This is more like a question:

In other tests we create Data files, and Tasks instead of mocks? We have several occasions where we had issues with mocks, so we try to avoid them, or limit the place where we use them as much as possible.

Maybe something like FileGenerationUtil.generateDataFile, or TestBase.FILE_A for files?
Also for the tasks we might be able to use MockFileScanTask.mockTask?
We could also create real RewriteFileGroup objects instead of mocks

Good point! I've refactored the tests to use real objects instead of extensive mocks:

Using FileGenerationUtil.generateDataFile() for DataFile objects

Using new MockFileScanTask() for FileScanTask objects

Using real RewriteFileGroup instances with ImmutableRewriteDataFiles.FileGroupInfo

pvary · 2026-01-09T10:21:49Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+  }
+
+  @Test
+  public void testCanMergeAndGetSchemaReturnsFalseForSortedTable() {


Should this be testCanMergeAndGetSchemaReturnsNullForSortedTable? Notice Null instead of False.

I've renamed all three test methods to use "ReturnsNull" instead of "ReturnsFalse" for accuracy.

pvary · 2026-01-09T10:22:11Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+  }
+
+  @Test
+  public void testCanMergeAndGetSchemaReturnsFalseForFilesWithDeleteFiles() {


Should this be testCanMergeAndGetSchemaReturnsNullForFilesWithDeleteFiles ? Notice Null instead of False.

Renamed to testCanMergeAndGetSchemaReturnsNullForFilesWithDeleteFiles() to accurately reflect that the method returns MessageType (null) rather than a boolean.

pvary · 2026-01-09T10:22:37Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+  }
+
+  @Test
+  public void testCanMergeAndGetSchemaReturnsFalseForTableWithMultipleColumnSort() {


Should this be testCanMergeAndGetSchemaReturnsNullForTableWithMultipleColumnSort ? Notice Null instead of False.

Renamed to testCanMergeAndGetSchemaReturnsNullForTableWithMultipleColumnSort() to accurately reflect that the method returns MessageType (null) rather than a boolean

pvary · 2026-01-09T10:23:34Z

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java

+    // Note: ParquetFileMerger.canMergeAndGetSchema would return null because
+    // mock files don't exist on disk, but the initial validation checks all pass


Then we need another check, or we should create the files in the tests

Removed it. The remaining tests already cover the initial validation checks while also testing specific failure conditions.

pvary · 2026-01-09T10:24:24Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+  }
+
+  @Test
+  public void testCanMergeReturnsFalseForNonParquetFile() throws IOException {


Rename the tests, so ReturnsNull instead of ReturnsFalse

Renamed all 6 test methods from "ReturnsFalse" to "ReturnsNull" to accurately reflect that the method returns MessageType (null) rather than a boolean.

pvary · 2026-01-09T10:25:23Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+
+  @Test
+  public void testMergeFilesWithoutRowLineage() throws IOException {
+    // Test that merging without firstRowIds/dataSequenceNumbers works (no row lineage columns)


These comments should be a method comment instead of a "random" comment in the beginning of the file

pvary · 2026-01-09T10:26:42Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+  }
+
+  @Test
+  public void testCanMergeReturnsFalseForPhysicalRowLineageWithNulls() throws IOException {


Could we move this test next to the other testCanMerge tests? And don't forget to change False to Null

Moved testCanMergeReturnsNullForPhysicalRowLineageWithNulls() to be grouped with the other testCanMerge* validation tests at the beginning of the file (after testCanMergeReturnsTrueForIdenticalSchemas()). The test was already renamed from "ReturnsFalse" to
"ReturnsNull" in an earlier fix.

pvary · 2026-01-09T10:27:07Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+  }
+
+  @Test
+  public void testCanMergeReturnsFalseForSameFieldNameDifferentType() throws IOException {


Again move this test to the other testCanMergeReturns tests

Moved testCanMergeReturnsNullForSameFieldNameDifferentType() to be grouped with the other testCanMerge* validation tests at the beginning of the file.

pvary · 2026-01-09T10:28:14Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestParquetFileMerger {


nit: The test classes don't have to be public. Make all of them package private

Changed both test classes from public to package-private as test classes don't need public access.

pvary · 2026-01-09T10:29:13Z

parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

+  }
+
+  @Test
+  public void testCanMergeThrowsForEmptyList() {


Please organize the test methods a bit.
Move every testCanMerge to one place, and every testMerge after.

Reorganized all test methods - all testCanMerge* validation tests are now grouped together at the beginning (10 tests), followed by all testMerge* operation tests (7 tests).

Guosmilesmile · 2026-01-14T06:28:25Z

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java

+    if (group.expectedOutputFiles() != 1) {
+      return null;
+    }
+
+    // Check if table has a sort order
+    if (table().sortOrder().isSorted()) {
+      return null;
+    }
+
+    // Check for delete files
+    boolean hasDeletes = group.fileScanTasks().stream().anyMatch(task -> !task.deletes().isEmpty());
+    if (hasDeletes) {
+      return null;
+    }
+
+    // Validate Parquet-specific requirements and get schema
+    return ParquetFileMerger.canMergeAndGetSchema(
+        Lists.newArrayList(group.rewrittenFiles()), table().io(), group.maxOutputFileSize());


I remember there was originally a check here to verify whether the files belong to the same partition, but I don’t see it now. Is this check still needed, or did I miss something?
group.rewrittenFiles().stream().anyMatch(file -> file.specId() != table.spec().specId())
@shangxinli @pvary

The spec ID consistency check is still happening, but it's been moved into ParquetFileMerger.canMergeAndGetSchema() rather than being in SparkParquetFileMergeRunner.canMergeAndGetSchema().

In ParquetFileMerger.canMergeAndGetSchema()

int firstSpecId = dataFiles.get(0).specId();
for (DataFile dataFile : dataFiles) {
if (dataFile.specId() != firstSpecId) {
return null; // Reject if any file has different spec
}
// ...
}

Guosmilesmile · 2026-01-22T13:46:25Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java

+    for (DataFile dataFile : dataFiles) {
+      if (dataFile.specId() != firstSpecId) {
+        return null;
+      }


I think we should add the check group.rewrittenFiles().stream().anyMatch(file -> file.specId() != group.outputSpecId()); in Spark side.

Situation can see TestDataFileRewriteRunner.testPartitionSpecChange()

iceberg/flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDataFileRewriteRunner.java

Line 99 in 15a72dc

void testPartitionSpecChange() throws Exception {

.

When the table partition changed, but the partition of data file is old. If we compare the partition of data file ,it will merge files to one file. We should assign these files to different partitions.

Good question! The spec consistency check is still happening, but it validates that all files have the same spec as each other rather than requiring they match the current table spec.

In ParquetFileMerger.canMergeAndGetSchema()

int firstSpecId = dataFiles.get(0).specId();
for (DataFile dataFile : dataFiles) {
if (dataFile.specId() != firstSpecId) {
return null; // Reject if files have different specs
}
}

In this place we only compare the specIf from dataFiles, But when the table partition changed, the partition of data file is old. So we should add the compare between group.outputSpecId() with dataFile.specId()

- Use getFileMetaData() directly instead of getFooter().getFileMetaData() - Add runnerDescription() method for test verification - Split test for generating vs preserving physical row IDs - Convert inline test comments to JavaDoc format - Remove redundant tests - Replace mocks with real test objects using FileGenerationUtil - Rename test methods from ReturnsFalse to ReturnsNull for accuracy - Make test classes package-private - Reorganize tests: group all testCanMerge* validation tests together, followed by testMerge* operation tests

github-actions bot added API spark parquet core labels Oct 28, 2025

shangxinli force-pushed the rewrite_data_files2 branch from c42af16 to 6bfe165 Compare October 29, 2025 01:27

shangxinli force-pushed the rewrite_data_files2 branch from 7be3ef0 to 7f2d5b0 Compare October 29, 2025 15:59

pvary reviewed Nov 3, 2025

View reviewed changes

core/src/main/java/org/apache/iceberg/TableProperties.java Outdated Show resolved Hide resolved

Address feedbacks

fa1d073

shangxinli force-pushed the rewrite_data_files2 branch from cae2d00 to fa1d073 Compare November 4, 2025 15:36

pvary reviewed Nov 4, 2025

View reviewed changes

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

pvary reviewed Nov 4, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java Outdated Show resolved Hide resolved

Address feedbacks for second round

7a34353

shangxinli force-pushed the rewrite_data_files2 branch from 8d6abd0 to 7a34353 Compare November 12, 2025 04:00

pvary reviewed Nov 12, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java Outdated Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java Outdated Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java Outdated Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

pvary reviewed Nov 12, 2025

View reviewed changes

...k/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

Guosmilesmile reviewed Nov 12, 2025

View reviewed changes

github-actions bot added the docs label Nov 16, 2025

shangxinli force-pushed the rewrite_data_files2 branch from a32947b to e6595f3 Compare November 16, 2025 19:14

pvary reviewed Jan 6, 2026

View reviewed changes

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

pvary reviewed Jan 6, 2026

View reviewed changes

....0/spark/src/test/java/org/apache/iceberg/spark/actions/TestSparkParquetFileMergeRunner.java Outdated Show resolved Hide resolved

shangxinli force-pushed the rewrite_data_files2 branch from 00df946 to 45b0197 Compare January 7, 2026 03:06

SourabhBadhya reviewed Jan 9, 2026

View reviewed changes

pvary reviewed Jan 9, 2026

View reviewed changes

Guosmilesmile mentioned this pull request Jan 14, 2026

Flink:Rewrite DataFile Support Parquet Merge #15047

Open

Guosmilesmile reviewed Jan 14, 2026

View reviewed changes

Guosmilesmile reviewed Jan 22, 2026

View reviewed changes

shangxinli added 2 commits January 25, 2026 14:08

Trigger CI re-run

417e0fa

		// Note: ParquetFileMerger.canMergeAndGetSchema would return null because
		// mock files don't exist on disk, but the initial validation checks all pass

Add ParquetFileMerger for efficient row-group level file merging #14435

Are you sure you want to change the base?

Add ParquetFileMerger for efficient row-group level file merging #14435

Conversation

shangxinli commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this change?

What changed?

Testing

Uh oh!

huaxingao commented Nov 3, 2025

Uh oh!

Uh oh!

pvary commented Nov 3, 2025

Uh oh!

shangxinli commented Nov 4, 2025

Uh oh!

shangxinli commented Nov 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shangxinli commented Nov 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pvary commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SourabhBadhya Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shangxinli commented Oct 28, 2025 •

edited

Loading

shangxinli Nov 16, 2025 •

edited

Loading

Guosmilesmile commented Nov 12, 2025 •

edited

Loading

SourabhBadhya Jan 9, 2026 •

edited

Loading

pvary Jan 9, 2026 •

edited

Loading