-
Notifications
You must be signed in to change notification settings - Fork 2.9k
CORE: [PARQUET] Log corrupted parquet filenames to trace bad nodes that may have written them. #13108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…at may have written them.
|
LGTM, though looks like you need to run |
|
|
||
| return last; | ||
| } catch (ParquetDecodingException e) { | ||
| if (reader != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is reader really nullable? This condition looks redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem plausible since org.apache.parquet.hadoop.ParquetFileReader, from where the value of reader is being inherited in ParquetReader, has nullable checks for reader as well.
| if (reader != null) { | ||
| // Knowing the exact parquet file is essential for tracing bad nodes | ||
| // that produced the corrupt file, parquet lib doesn't do this today. | ||
| LOG.error("Error decoding Parquet file {}", reader.getFile(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetInputFile doesn't implement toString method, right? Does this log print human-readable path?
public String getFile() {
return file.toString();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just print the location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
location is a method on the org.apache.iceberg.io.OutputFile interface, and not org.apache.iceberg.io.InputFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it?
| String location(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, it indeed does. Unfortunately though org.apache.parquet.hadoop.ParquetFileReader encapsulates the InputFile, and only exposes getFile() to get the parquet file location. There is also getPath() but that has been marked as deprecated.
|
Merged to main. |

We should intercept the
ParquetDecodingExceptionand log the corrupted parquet file to make this easily discoverable. Knowing the exact parquet file that the executor failed on is essential for identifying bad nodes in a cluster that could be producing corrupt data, and eventually take them out.