Skip to content

Conversation

@dhruv-pratap
Copy link
Contributor

We should intercept the ParquetDecodingException and log the corrupted parquet file to make this easily discoverable. Knowing the exact parquet file that the executor failed on is essential for identifying bad nodes in a cluster that could be producing corrupt data, and eventually take them out.

23/12/10 13:28:04 WARN task-result-getter-0 TaskSetManager: Lost task 55.0 in stage 0.0 (TID 55) (ip-100-74-15-132.ec2.internal executor 28): org.apache.iceberg.bdp.shaded.org.apache.parquet.io.ParquetDecodingException: could not decompress page

@bryanck
Copy link
Contributor

bryanck commented May 20, 2025

LGTM, though looks like you need to run spotlessApply.


return last;
} catch (ParquetDecodingException e) {
if (reader != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reader really nullable? This condition looks redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem plausible since org.apache.parquet.hadoop.ParquetFileReader, from where the value of reader is being inherited in ParquetReader, has nullable checks for reader as well.

if (reader != null) {
// Knowing the exact parquet file is essential for tracing bad nodes
// that produced the corrupt file, parquet lib doesn't do this today.
LOG.error("Error decoding Parquet file {}", reader.getFile(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetInputFile doesn't implement toString method, right? Does this log print human-readable path?

  public String getFile() {
    return file.toString();
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.iceberg.io.InputFile doesn't but all its implementations do. For example see S3InputFile below:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just print the location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location is a method on the org.apache.iceberg.io.OutputFile interface, and not org.apache.iceberg.io.InputFile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it indeed does. Unfortunately though org.apache.parquet.hadoop.ParquetFileReader encapsulates the InputFile, and only exposes getFile() to get the parquet file location. There is also getPath() but that has been marked as deprecated.

@pvary pvary merged commit 91dff98 into apache:main May 22, 2025
42 checks passed
@pvary
Copy link
Contributor

pvary commented May 22, 2025

Merged to main.
Thanks for the PR @dhruv-pratap, @ebyhr, @bryanck for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants