-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query #14690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6b02b3c
e816919
8cca6dc
7acc3f1
cf7d1f1
c75855c
821372f
d0b893b
c47a2a3
ed7dd37
bdff488
5ad4b25
fa19224
00bf912
b5f7691
77932a1
97cd27d
851d7f9
26e0d34
83a168c
014c998
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,7 +26,7 @@ private[spark] object StaticSources { | |
| * The set of all static sources. These sources may be reported to from any class, including | ||
| * static classes, without requiring reference to a SparkEnv. | ||
| */ | ||
| val allSources = Seq(CodegenMetrics) | ||
| val allSources = Seq(CodegenMetrics, HiveCatalogMetrics) | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -60,3 +60,35 @@ object CodegenMetrics extends Source { | |
| val METRIC_GENERATED_METHOD_BYTECODE_SIZE = | ||
| metricRegistry.histogram(MetricRegistry.name("generatedMethodSize")) | ||
| } | ||
|
|
||
| /** | ||
| * :: Experimental :: | ||
| * Metrics for access to the hive external catalog. | ||
| */ | ||
| @Experimental | ||
| object HiveCatalogMetrics extends Source { | ||
| override val sourceName: String = "HiveExternalCatalog" | ||
| override val metricRegistry: MetricRegistry = new MetricRegistry() | ||
|
|
||
| /** | ||
| * Tracks the total number of partition metadata entries fetched via the client api. | ||
| */ | ||
| val METRIC_PARTITIONS_FETCHED = metricRegistry.counter(MetricRegistry.name("partitionsFetched")) | ||
|
|
||
| /** | ||
| * Tracks the total number of files discovered off of the filesystem by ListingFileCatalog. | ||
| */ | ||
| val METRIC_FILES_DISCOVERED = metricRegistry.counter(MetricRegistry.name("filesDiscovered")) | ||
|
|
||
| /** | ||
| * Resets the values of all metrics to zero. This is useful in tests. | ||
| */ | ||
| def reset(): Unit = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we mention that this is for testing only?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in VideoAmp#6 |
||
| METRIC_PARTITIONS_FETCHED.dec(METRIC_PARTITIONS_FETCHED.getCount()) | ||
| METRIC_FILES_DISCOVERED.dec(METRIC_FILES_DISCOVERED.getCount()) | ||
| } | ||
|
|
||
| // clients can use these to avoid classloader issues with the codahale classes | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand this comment. what issue do this 2 method address?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand the issue, but if you reference the Counter object directly from the caller sites then you get
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, maybe this is a load order issue |
||
| def incrementFetchedPartitions(n: Int): Unit = METRIC_PARTITIONS_FETCHED.inc(n) | ||
| def incrementFilesDiscovered(n: Int): Unit = METRIC_FILES_DISCOVERED.inc(n) | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,7 +43,7 @@ import org.apache.spark.sql.catalyst.plans.logical._ | |
| import org.apache.spark.sql.catalyst.util.usePrettyExpression | ||
| import org.apache.spark.sql.execution.{FileRelation, LogicalRDD, QueryExecution, SQLExecution} | ||
| import org.apache.spark.sql.execution.command.{CreateViewCommand, ExplainCommand, GlobalTempView, LocalTempView} | ||
| import org.apache.spark.sql.execution.datasources.LogicalRelation | ||
| import org.apache.spark.sql.execution.datasources.{FileCatalog, HadoopFsRelation, LogicalRelation} | ||
| import org.apache.spark.sql.execution.datasources.json.JacksonGenerator | ||
| import org.apache.spark.sql.execution.python.EvaluatePython | ||
| import org.apache.spark.sql.streaming.{DataStreamWriter, StreamingQuery} | ||
|
|
@@ -2602,7 +2602,7 @@ class Dataset[T] private[sql]( | |
| * @since 2.0.0 | ||
| */ | ||
| def inputFiles: Array[String] = { | ||
| val files: Seq[String] = logicalPlan.collect { | ||
| val files: Seq[String] = queryExecution.optimizedPlan.collect { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why this change?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only determine the partitions read after optimization, so it's necessary to read it from that instead of the logical plan. |
||
| case LogicalRelation(fsBasedRelation: FileRelation, _, _) => | ||
| fsBasedRelation.inputFiles | ||
| case fr: FileRelation => | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -225,13 +225,27 @@ case class FileSourceScanExec( | |
| } | ||
|
|
||
| // These metadata values make scan plans uniquely identifiable for equality checking. | ||
| override val metadata: Map[String, String] = Map( | ||
| "Format" -> relation.fileFormat.toString, | ||
| "ReadSchema" -> outputSchema.catalogString, | ||
| "Batched" -> supportsBatch.toString, | ||
| "PartitionFilters" -> partitionFilters.mkString("[", ", ", "]"), | ||
| "PushedFilters" -> dataFilters.mkString("[", ", ", "]"), | ||
| "InputPaths" -> relation.location.paths.mkString(", ")) | ||
| override val metadata: Map[String, String] = { | ||
| def seqToString(seq: Seq[Any]) = seq.mkString("[", ", ", "]") | ||
| val location = relation.location | ||
| val locationDesc = | ||
| location.getClass.getSimpleName + seqToString(location.rootPaths) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should they be separated by space?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This style emulates the way relations are shown in a query plan, e.g.
I'm in favor of keeping this as-is. |
||
| val metadata = | ||
| Map( | ||
| "Format" -> relation.fileFormat.toString, | ||
| "ReadSchema" -> outputSchema.catalogString, | ||
| "Batched" -> supportsBatch.toString, | ||
| "PartitionFilters" -> seqToString(partitionFilters), | ||
| "PushedFilters" -> seqToString(dataFilters), | ||
| "Location" -> locationDesc) | ||
| val withOptPartitionCount = | ||
| relation.partitionSchemaOption.map { _ => | ||
| metadata + ("PartitionCount" -> selectedPartitions.size.toString) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
| } getOrElse { | ||
| metadata | ||
| } | ||
| withOptPartitionCount | ||
| } | ||
|
|
||
| private lazy val inputRDD: RDD[InternalRow] = { | ||
| val readFile: (PartitionedFile) => Iterator[InternalRow] = | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move it to sql module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, codegen is here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update
StaticSources.allSources?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the reason you can't is because to register this source it needs to be in the list above. This made me realize I forgot to add it to the list actually: VideoAmp#6