Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Fall back to Spark if query uses DPP with v1 data sources #897

Merged
merged 6 commits into from
Sep 4, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 30, 2024

Which issue does this PR close?

Partial fix for #895 (only addressed the issue for v1 sources)

Rationale for this change

Avoid performance regressions in TPC-DS when sales tables are partitioned by date

What changes are included in this PR?

How are these changes tested?

@@ -95,6 +95,10 @@ class CometSparkSessionExtensions
plan
} else {
plan.transform {
case scanExec: FileSourceScanExec if scanExec.partitionFilters.nonEmpty =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all partition filters are DPP filter. They are probably pushed down filters.

In Spark, DataSourceScanExec has a check we can use:

private def isDynamicPruningFilter(e: Expression): Boolean =
    e.exists(_.isInstanceOf[PlanExpression[_]])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you know how we can perform this check for v2 data sources (BatchScanExec)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchScanExec has runtimeFilters. I think it is similar?

case class BatchScanExec(
  ...
 @transient private lazy val filteredPartitions: Seq[Seq[InputPartition]] = {
    val dataSourceFilters = runtimeFilters.flatMap {
      case DynamicPruningExpression(e) => DataSourceV2Strategy.translateRuntimeFilterV2(e)
      case _ => None
    }
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in this PR does not trigger DPP when using v2 data source. I read that DPP is more optimized for v1 but not sure if that is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I took a look at DataSourceV2Strategy. It have pushed down DPP filter (i.e., DynamicPruning) into BatchScanExec (i.e., runtimeFilters).

 case PhysicalOperation(project, filters, relation: DataSourceV2ScanRelation) =>
      // projection and filters were already pushed down in the optimizer.
      // this uses PhysicalOperation to get the projection and ensure that if the batch scan does
      // not support columnar, a projection is added to convert the rows to UnsafeRow.
      val (runtimeFilters, postScanFilters) = filters.partition {
        case _: DynamicPruning => true
        case _ => false
      }
      val batchExec = BatchScanExec(relation.output, relation.scan, runtimeFilters,
        relation.ordering, relation.relation.table,
        StoragePartitionJoinParams(relation.keyGroupedPartitioning))

@andygrove andygrove changed the title perf: Fall back to Spark if query uses DPP perf: Fall back to Spark if query uses DPP with v1 data sources Aug 30, 2024
@andygrove andygrove requested a review from viirya September 3, 2024 22:47
@andygrove
Copy link
Member Author

@viirya I made this PR specific to v1 data source for now. Could you review again?

@viirya
Copy link
Member

viirya commented Sep 4, 2024

Looks good to me. Thanks @andygrove

@andygrove andygrove merged commit 60cad71 into apache:main Sep 4, 2024
74 checks passed
@andygrove andygrove deleted the dpp-fallback branch December 3, 2024 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants