-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
7d9ead8
7f5889a
a88f032
45a6e7c
f2f0da9
a5646c9
e712ca3
035742d
9d2b2dd
f043c69
9f3bd92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.datasources | |
| import java.io.{Closeable, FileNotFoundException} | ||
| import java.net.URI | ||
|
|
||
| import org.apache.hadoop.fs.Path | ||
| import org.apache.hadoop.fs.{FileStatus, Path} | ||
| import org.apache.hadoop.hdfs.BlockMissingException | ||
| import org.apache.hadoop.security.AccessControlException | ||
|
|
||
|
|
@@ -47,26 +47,25 @@ import org.apache.spark.util.NextIterator | |
| * that need to be prepended to each row. | ||
| * | ||
| * @param partitionValues value of partition columns to be prepended to each row. | ||
| * @param filePath URI of the file to read | ||
| * @param start the beginning offset (in bytes) of the block. | ||
| * @param length number of bytes to read. | ||
| * @param modificationTime The modification time of the input file, in milliseconds. | ||
| * @param fileSize The length of the input file (not the block), in bytes. | ||
| * @param fileStatus The FileStatus instance of the file to read. | ||
| * @param otherConstantMetadataColumnValues The values of any additional constant metadata columns. | ||
| */ | ||
| case class PartitionedFile( | ||
| partitionValues: InternalRow, | ||
| filePath: SparkPath, | ||
| start: Long, | ||
| length: Long, | ||
| fileStatus: FileStatus, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly, due to the addition of
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition, since
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fileStatus should occupy a little bit more memory, but I haven't received OOM issues during the rollout of this change to the online cluster.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan Are there also risks of breaking internal APIs with modifications similar to those made here and in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the cost of serializing file status?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan I think
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to have a custom serde for it and only send the path string? This reminds me of
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems not feasible, because
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan the change basically moves the RPC cost from executor => storage service, to driver => executors, in my env (HDFS with RBF), the latter is much cheaper than the former. I don't have cloud env, so I can't give numbers for object storage services like S3
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, then this may cause regression for short queries?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm not sure how much difference this will make it terms of driver memory usage. Is it easy to make the It seems in Parquet Java the file status is only used in one case: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java#L109-L132 Mostly we just need file path and length. But yea this one use case seems critical to avoid duplicated NN call to get the file status again.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sunchao thanks for your suggestion, after an offline discussion with @cloud-fan, I understand his concerns about the overhead of
so, I'm going to split this PR into two parts
|
||
| @transient locations: Array[String] = Array.empty, | ||
| modificationTime: Long = 0L, | ||
| fileSize: Long = 0L, | ||
| otherConstantMetadataColumnValues: Map[String, Any] = Map.empty) { | ||
|
|
||
| @transient lazy val filePath: SparkPath = SparkPath.fromFileStatus(fileStatus) | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If |
||
| def pathUri: URI = filePath.toUri | ||
| def toPath: Path = filePath.toPath | ||
| def urlEncodedPath: String = filePath.urlEncoded | ||
| def modificationTime: Long = fileStatus.getModificationTime | ||
| def fileSize: Long = fileStatus.getLen | ||
|
|
||
| override def toString: String = { | ||
| s"path: $filePath, range: $start-${start + length}, partition values: $partitionValues" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor internally calls
HadoopInputFile.fromPath(file, configuration), which produces an unnecessaryGetFileInfoRPC