-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19916][SQL] simplify bad file handling #17253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| } catch { | ||
| // Throw FileNotFoundException even `ignoreCorruptFiles` is true | ||
| case e: java.io.FileNotFoundException => throw e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: FileNotFoundException will be thrown anyway, do wee need this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileNotFoundException extends IOException
|
LGTM with minor comments. |
| start: Long, | ||
| length: Long, | ||
| locations: Array[String] = Array.empty) { | ||
| @transient locations: Array[String] = Array.empty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to mark it as transient? filePartitions: Seq[FilePartition]) is already transient in FileScanRDD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not for FileScanRDD.filePartitions, this is for FilePartitions that sent by scheduler. The location is only useful during planning, we should not send it to executors.
|
Test build #74367 has finished for PR 17253 at commit
|
| filters: Seq[Filter], | ||
| options: Map[String, String], | ||
| hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow] = { | ||
| // TODO: Remove this default implementation when the other formats have been ported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more TODO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't need to implement this method in all sub-classes. Some FileFormat may override buildReaderWithPartitionValues directly(parquet), Some FileFormat may not be used in read path(HiveFileFormat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I am wondering why the buildReader method is now marked as protected? Maybe you can comment here: https://issues.apache.org/jira/browse/SPARK-27751
|
LGTM |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| // Throw FileNotFoundException even `ignoreCorruptFiles` is true | ||
| case e: java.io.FileNotFoundException => throw e | ||
| case e @ (_: RuntimeException | _: IOException) => | ||
| logWarning(s"Skipped the rest content in the corrupted file: $currentFile", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better English: "Skipped the rest of the content in the corrupted file:"
| try { | ||
| readFunction(currentFile) | ||
| } catch { | ||
| case e: java.io.FileNotFoundException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not import the class? same below
|
Test build #74420 has finished for PR 17253 at commit
|
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
We should only have one centre place to try catch the exception for corrupted files.
How was this patch tested?
existing test