-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24478][SQL][followup] Move projection and filter push down to physical conversion #21574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,51 +17,115 @@ | |
|
|
||
| package org.apache.spark.sql.execution.datasources.v2 | ||
|
|
||
| import org.apache.spark.sql.{execution, Strategy} | ||
| import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet} | ||
| import scala.collection.mutable | ||
|
|
||
| import org.apache.spark.sql.{sources, Strategy} | ||
| import org.apache.spark.sql.catalyst.expressions.{And, AttributeReference, AttributeSet, Expression} | ||
| import org.apache.spark.sql.catalyst.planning.PhysicalOperation | ||
| import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan | ||
| import org.apache.spark.sql.execution.SparkPlan | ||
| import org.apache.spark.sql.execution.{FilterExec, ProjectExec, SparkPlan} | ||
| import org.apache.spark.sql.execution.datasources.DataSourceStrategy | ||
| import org.apache.spark.sql.execution.streaming.continuous.{WriteToContinuousDataSource, WriteToContinuousDataSourceExec} | ||
| import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, SupportsPushDownCatalystFilters, SupportsPushDownFilters, SupportsPushDownRequiredColumns} | ||
|
|
||
| object DataSourceV2Strategy extends Strategy { | ||
| override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | ||
| case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => | ||
| val projectSet = AttributeSet(project.flatMap(_.references)) | ||
| val filterSet = AttributeSet(filters.flatMap(_.references)) | ||
|
|
||
| val projection = if (filterSet.subsetOf(projectSet) && | ||
| AttributeSet(relation.output) == projectSet) { | ||
| // When the required projection contains all of the filter columns and column pruning alone | ||
| // can produce the required projection, push the required projection. | ||
| // A final projection may still be needed if the data source produces a different column | ||
| // order or if it cannot prune all of the nested columns. | ||
| relation.output | ||
| } else { | ||
| // When there are filter columns not already in the required projection or when the required | ||
| // projection is more complicated than column pruning, base column pruning on the set of | ||
| // all columns needed by both. | ||
| (projectSet ++ filterSet).toSeq | ||
| } | ||
|
|
||
| val reader = relation.newReader | ||
| /** | ||
| * Pushes down filters to the data source reader | ||
| * | ||
| * @return pushed filter and post-scan filters. | ||
| */ | ||
| private def pushFilters( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for moving these functions. I considered it in the other commit, but decided to go with fewer changes. I like them here. |
||
| reader: DataSourceReader, | ||
| filters: Seq[Expression]): (Seq[Expression], Seq[Expression]) = { | ||
| reader match { | ||
| case r: SupportsPushDownCatalystFilters => | ||
| val postScanFilters = r.pushCatalystFilters(filters.toArray) | ||
| val pushedFilters = r.pushedCatalystFilters() | ||
| (pushedFilters, postScanFilters) | ||
|
|
||
| case r: SupportsPushDownFilters => | ||
| // A map from translated data source filters to original catalyst filter expressions. | ||
| val translatedFilterToExpr = mutable.HashMap.empty[sources.Filter, Expression] | ||
| // Catalyst filter expression that can't be translated to data source filters. | ||
| val untranslatableExprs = mutable.ArrayBuffer.empty[Expression] | ||
|
|
||
| for (filterExpr <- filters) { | ||
| val translated = DataSourceStrategy.translateFilter(filterExpr) | ||
| if (translated.isDefined) { | ||
| translatedFilterToExpr(translated.get) = filterExpr | ||
| } else { | ||
| untranslatableExprs += filterExpr | ||
| } | ||
| } | ||
|
|
||
| // Data source filters that need to be evaluated again after scanning. which means | ||
| // the data source cannot guarantee the rows returned can pass these filters. | ||
| // As a result we must return it so Spark can plan an extra filter operator. | ||
| val postScanFilters = r.pushFilters(translatedFilterToExpr.keys.toArray) | ||
| .map(translatedFilterToExpr) | ||
| // The filters which are marked as pushed to this data source | ||
| val pushedFilters = r.pushedFilters().map(translatedFilterToExpr) | ||
| (pushedFilters, untranslatableExprs ++ postScanFilters) | ||
|
|
||
| case _ => (Nil, filters) | ||
| } | ||
| } | ||
|
|
||
| val output = DataSourceV2Relation.pushRequiredColumns(relation, reader, | ||
| projection.asInstanceOf[Seq[AttributeReference]].toStructType) | ||
| /** | ||
| * Applies column pruning to the data source, w.r.t. the references of the given expressions. | ||
| * | ||
| * @return new output attributes after column pruning. | ||
| */ | ||
| // TODO: nested column pruning. | ||
| private def pruneColumns( | ||
| reader: DataSourceReader, | ||
| relation: DataSourceV2Relation, | ||
| exprs: Seq[Expression]): Seq[AttributeReference] = { | ||
| reader match { | ||
| case r: SupportsPushDownRequiredColumns => | ||
| val requiredColumns = AttributeSet(exprs.flatMap(_.references)) | ||
| val neededOutput = relation.output.filter(requiredColumns.contains) | ||
| if (neededOutput != relation.output) { | ||
| r.pruneColumns(neededOutput.toStructType) | ||
| val nameToAttr = relation.output.map(_.name).zip(relation.output).toMap | ||
| r.readSchema().toAttributes.map { | ||
| // We have to keep the attribute id during transformation. | ||
| a => a.withExprId(nameToAttr(a.name).exprId) | ||
| } | ||
| } else { | ||
| relation.output | ||
| } | ||
|
|
||
| case _ => relation.output | ||
| } | ||
| } | ||
|
|
||
| val (postScanFilters, pushedFilters) = DataSourceV2Relation.pushFilters(reader, filters) | ||
|
|
||
| logInfo(s"Post-Scan Filters: ${postScanFilters.mkString(",")}") | ||
| logInfo(s"Pushed Filters: ${pushedFilters.mkString(", ")}") | ||
| override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { | ||
| case PhysicalOperation(project, filters, relation: DataSourceV2Relation) => | ||
| val reader = relation.newReader() | ||
| // `pushedFilters` will be pushed down and evaluated in the underlying data sources. | ||
| // `postScanFilters` need to be evaluated after the scan. | ||
| // `postScanFilters` and `pushedFilters` can overlap, e.g. the parquet row group filter. | ||
| val (pushedFilters, postScanFilters) = pushFilters(reader, filters) | ||
| val output = pruneColumns(reader, relation, project ++ postScanFilters) | ||
| logInfo( | ||
| s""" | ||
| |Pushing operators to ${relation.source.getClass} | ||
| |Pushed Filters: ${pushedFilters.mkString(", ")} | ||
| |Post-Scan Filters: ${postScanFilters.mkString(",")} | ||
| |Output: ${output.mkString(", ")} | ||
| """.stripMargin) | ||
|
|
||
| val scan = DataSourceV2ScanExec( | ||
| output, relation.source, relation.options, pushedFilters, reader) | ||
|
|
||
| val filter = postScanFilters.reduceLeftOption(And) | ||
| val withFilter = filter.map(execution.FilterExec(_, scan)).getOrElse(scan) | ||
| val filterCondition = postScanFilters.reduceLeftOption(And) | ||
| val withFilter = filterCondition.map(FilterExec(_, scan)).getOrElse(scan) | ||
|
|
||
| val withProjection = if (withFilter.output != project) { | ||
| execution.ProjectExec(project, withFilter) | ||
| ProjectExec(project, withFilter) | ||
| } else { | ||
| withFilter | ||
| } | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find a place that uses this default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because there are few places that create v2 relations so far, but when SQL statements and other paths that don't allow you to supply your own schema are added, I think this will be more common. It's okay to remove it, but I don't see much value in the change and I like to keep non-functional changes to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I agree we should minimum the non-functional changes, but removing dead code is also good to do. This is really a small change, if we do need the default value in the future, it's very easy to add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, it's up to you.