CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108

dhruv-pratap · 2025-05-20T21:15:35Z

We should intercept the ParquetDecodingException and log the corrupted parquet file to make this easily discoverable. Knowing the exact parquet file that the executor failed on is essential for identifying bad nodes in a cluster that could be producing corrupt data, and eventually take them out.

23/12/10 13:28:04 WARN task-result-getter-0 TaskSetManager: Lost task 55.0 in stage 0.0 (TID 55) (ip-100-74-15-132.ec2.internal executor 28): org.apache.iceberg.bdp.shaded.org.apache.parquet.io.ParquetDecodingException: could not decompress page

…at may have written them.

bryanck · 2025-05-20T21:25:23Z

LGTM, though looks like you need to run spotlessApply.

ebyhr · 2025-05-20T22:10:30Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java

+
+        return last;
+      } catch (ParquetDecodingException e) {
+        if (reader != null) {


Is reader really nullable? This condition looks redundant.

It does seem plausible since org.apache.parquet.hadoop.ParquetFileReader, from where the value of reader is being inherited in ParquetReader, has nullable checks for reader as well.

ebyhr · 2025-05-20T22:21:05Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java

+        if (reader != null) {
+          // Knowing the exact parquet file is essential for tracing bad nodes
+          // that produced the corrupt file, parquet lib doesn't do this today.
+          LOG.error("Error decoding Parquet file {}", reader.getFile(), e);


ParquetInputFile doesn't implement toString method, right? Does this log print human-readable path?

public String getFile() { return file.toString(); }

org.apache.iceberg.io.InputFile doesn't but all its implementations do. For example see S3InputFile below:

Could we just print the location?

location is a method on the org.apache.iceberg.io.OutputFile interface, and not org.apache.iceberg.io.InputFile.

Is it?

iceberg/api/src/main/java/org/apache/iceberg/io/InputFile.java

Line 53 in 6550486

String location();

My bad, it indeed does. Unfortunately though org.apache.parquet.hadoop.ParquetFileReader encapsulates the InputFile, and only exposes getFile() to get the parquet file location. There is also getPath() but that has been marked as deprecated.

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java

pvary · 2025-05-22T11:41:59Z

Merged to main.
Thanks for the PR @dhruv-pratap, @ebyhr, @bryanck for the review!

CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes th…

f1343e4

…at may have written them.

github-actions bot added the parquet label May 20, 2025

bryanck approved these changes May 20, 2025

View reviewed changes

Spotless apply.

4c91031

ebyhr reviewed May 20, 2025

View reviewed changes

pvary reviewed May 21, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java Show resolved Hide resolved

pvary reviewed May 21, 2025

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java Show resolved Hide resolved

Following code guidelines, adding newline after conditional block.

d6cd8c2

pvary approved these changes May 22, 2025

View reviewed changes

pvary merged commit 91dff98 into apache:main May 22, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108

CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108

Uh oh!

dhruv-pratap commented May 20, 2025

Uh oh!

bryanck commented May 20, 2025

Uh oh!

ebyhr May 20, 2025

Uh oh!

dhruv-pratap May 21, 2025

Uh oh!

ebyhr May 20, 2025

Uh oh!

dhruv-pratap May 21, 2025

Uh oh!

pvary May 21, 2025

Uh oh!

dhruv-pratap May 21, 2025

Uh oh!

pvary May 21, 2025

Uh oh!

dhruv-pratap May 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108

CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108

Uh oh!

Conversation

dhruv-pratap commented May 20, 2025

Uh oh!

bryanck commented May 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants