-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cloud reading for orc #2828
Conversation
Signed-off-by: Bobby Wang <wbo4958@gmail.com>
build |
* @param conf configuration | ||
* @return cloud reading PartitionReader | ||
*/ | ||
def buildBaseColumnarReaderForCloud(files: Array[PartitionedFile], conf: Configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
def buildBaseColumnarReaderForCloud(files: Array[PartitionedFile], conf: Configuration): | |
def buildBaseColumnarReaderForCloud( | |
files: Array[PartitionedFile], | |
conf: Configuration): PartitionReader[ColumnarBatch] |
* @param conf the configuration | ||
* @return coalescing reading PartitionReader | ||
*/ | ||
def buildBaseColumnarReaderForCoalescing(files: Array[PartitionedFile], conf: Configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
def buildBaseColumnarReaderForCoalescing(files: Array[PartitionedFile], conf: Configuration): | |
def buildBaseColumnarReaderForCoalescing( | |
files: Array[PartitionedFile], | |
conf: Configuration): PartitionReader[ColumnarBatch] |
@@ -549,6 +554,7 @@ case class GpuFileSourceScanExec( | |||
None, | |||
queryUsesInputFile)(rapidsConf) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change ?
allMetrics, | ||
queryUsesInputFile) | ||
|
||
val factory = fsRelation.fileFormat match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it has the totally the same creation code for both parquet and orc, so it can be simplified as:
val factory = fsRelation.fileFormat match {
case _: ParquetFileFormat | OrcFileFormat =>
GpuParquetMultiFilePartitionReaderFactory(
sqlConf,
broadcastedHadoopConf,
relation.dataSchema,
requiredSchema,
relation.partitionSchema,
pushedDownFilters.toArray,
rapidsConf,
allMetrics,
queryUsesInputFile)
case _ =>
// never reach here
throw new RuntimeException(s"File format ${fsRelation.fileFormat} is not supported yet")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's different, the other one is GpuOrcMultiFilePartitionReaderFactory
private val isParquetFileFormat: Boolean = relation.fileFormat.isInstanceOf[ParquetFileFormat] | ||
private val isPerFileReadEnabled = rapidsConf.isParquetPerFileReadEnabled || !isParquetFileFormat | ||
// CSV should be always using PERFILE read type | ||
val isPerFileReadEnabled = rapidsConf.isParquetPerFileReadEnabled || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
val isPerFileReadEnabled = rapidsConf.isParquetPerFileReadEnabled || | |
private val isPerFileReadEnabled = rapidsConf.isParquetPerFileReadEnabled || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
@@ -82,6 +83,16 @@ abstract class GpuOrcScanBase( | |||
new SerializableConfiguration(hadoopConf)) | |||
GpuOrcPartitionReaderFactory(sparkSession.sessionState.conf, broadcastedConf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget to remove lines 84 - 85 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good finding.
queryUsesInputFile: Boolean) | ||
extends MultiFilePartitionReaderFactoryBase(sqlConf, broadcastedConf, rapidsConf) { | ||
|
||
private val fileHandler = GpuOrcFileFilterHandler(sqlConf, broadcastedConf, filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to try to mark members as transient
if it will not be used on exectuors, to avoid unnecessary and unexpected serializations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fileHandler will be used on the executor side.
* | ||
* @return the file format short name | ||
*/ | ||
override def getFileFormatShortName: String = "ORC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
override def getFileFormatShortName: String = "ORC" | |
override final def getFileFormatShortName: String = "ORC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty change ?
} else { | ||
val table = readToTable(currentStripes) | ||
try { | ||
table.map(GpuColumnVector.from(_, readDataSchema.toArray.map(_.dataType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
There is already an API extractTypes
to convert the schema to array of DataType.
table.map(GpuColumnVector.from(_, readDataSchema.toArray.map(_.dataType))) | |
table.map(GpuColumnVector.from(_, extractTypes(readDataSchema))) |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala
Outdated
Show resolved
Hide resolved
did you also validate the data read was the same as perfile? meaning we read it all correctly. |
You mean I should validate the results of PERFILE and Cloud reading when doing performance test? The performance test is based on non-partitioned files. I will run it again based on partitioned files. |
Hi @tgravescs I did another round of performance tests on a total of 153 partitioned ORC files, a total of 1.3G the file path like
I also did the comparison for the results collected back to the driver for both PERFILE and MULTITHREADED reading. Since the driver's memory limitation, I only tested on partitioned 41 orc files, total 316M. And the local sorted results are the same. I also compared the result reading from CPU and MULTITHREADED, and the local sorted results are the same. |
build |
Sorry for my delay, I was ooo. Yes I mean make sure the results coming back are the same as the CPU side results and we didn't drop or corrupt data. It sounds like your latest result verify this on a smaller set of data. You could have also written it out to files and then validated but it sounds like what you did is sufficient |
Thx Tom |
This PR adds the cloud reading logic for ORC file format, and the implementation is quite similar to what we have done for Parquet file format.
I have done a round of performance test on a total of 100 non-partitioned ORC files, total 1.3G
Cloud reading has about ~3x speed up than PERFILE on 100 ORC files.
I can't compare the performance on more ORC files since #2850
This PR didn't fix #2850, I will fix in another PR.