perf: Fall back to Spark if query uses DPP with v1 data sources #897

andygrove · 2024-08-30T17:45:09Z

Which issue does this PR close?

Partial fix for #895 (only addressed the issue for v1 sources)

Rationale for this change

Avoid performance regressions in TPC-DS when sales tables are partitioned by date

What changes are included in this PR?

How are these changes tested?

viirya · 2024-08-30T17:53:28Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -95,6 +95,10 @@ class CometSparkSessionExtensions
        plan
      } else {
        plan.transform {
+          case scanExec: FileSourceScanExec if scanExec.partitionFilters.nonEmpty =>


Not all partition filters are DPP filter. They are probably pushed down filters.

In Spark, DataSourceScanExec has a check we can use:

private def isDynamicPruningFilter(e: Expression): Boolean = e.exists(_.isInstanceOf[PlanExpression[_]])

Also, do you know how we can perform this check for v2 data sources (BatchScanExec)?

BatchScanExec has runtimeFilters. I think it is similar?

case class BatchScanExec( ... @transient private lazy val filteredPartitions: Seq[Seq[InputPartition]] = { val dataSourceFilters = runtimeFilters.flatMap { case DynamicPruningExpression(e) => DataSourceV2Strategy.translateRuntimeFilterV2(e) case _ => None } ...

The test in this PR does not trigger DPP when using v2 data source. I read that DPP is more optimized for v1 but not sure if that is correct.

Hmm, I took a look at DataSourceV2Strategy. It have pushed down DPP filter (i.e., DynamicPruning) into BatchScanExec (i.e., runtimeFilters).

case PhysicalOperation(project, filters, relation: DataSourceV2ScanRelation) => // projection and filters were already pushed down in the optimizer. // this uses PhysicalOperation to get the projection and ensure that if the batch scan does // not support columnar, a projection is added to convert the rows to UnsafeRow. val (runtimeFilters, postScanFilters) = filters.partition { case _: DynamicPruning => true case _ => false } val batchExec = BatchScanExec(relation.output, relation.scan, runtimeFilters, relation.ordering, relation.relation.table, StoragePartitionJoinParams(relation.keyGroupedPartitioning))

andygrove · 2024-09-03T22:48:15Z

@viirya I made this PR specific to v1 data source for now. Could you review again?

viirya · 2024-09-04T01:27:09Z

Looks good to me. Thanks @andygrove

DPP fallback

18ddc42

viirya reviewed Aug 30, 2024

View reviewed changes

andygrove added 3 commits August 30, 2024 12:10

address feedback

2c64e6b

add conf for DPP fallback and disable for stability plan tests:

2e09925

generated docs

fa2215e

andygrove changed the title ~~perf: Fall back to Spark if query uses DPP~~ perf: Fall back to Spark if query uses DPP with v1 data sources Aug 30, 2024

andygrove added 2 commits August 30, 2024 16:39

fix regression

6b8eb66

simplify test

acdf9df

andygrove requested a review from viirya September 3, 2024 22:47

viirya approved these changes Sep 4, 2024

View reviewed changes

andygrove merged commit 60cad71 into apache:main Sep 4, 2024
74 checks passed

andygrove deleted the dpp-fallback branch December 3, 2024 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Fall back to Spark if query uses DPP with v1 data sources #897

perf: Fall back to Spark if query uses DPP with v1 data sources #897

andygrove commented Aug 30, 2024 •

edited

Loading

viirya Aug 30, 2024

andygrove Aug 30, 2024

viirya Aug 30, 2024

andygrove Aug 30, 2024

viirya Aug 30, 2024

andygrove commented Sep 3, 2024

viirya commented Sep 4, 2024

perf: Fall back to Spark if query uses DPP with v1 data sources #897

perf: Fall back to Spark if query uses DPP with v1 data sources #897

Conversation

andygrove commented Aug 30, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya Aug 30, 2024

Choose a reason for hiding this comment

andygrove Aug 30, 2024

Choose a reason for hiding this comment

viirya Aug 30, 2024

Choose a reason for hiding this comment

andygrove Aug 30, 2024

Choose a reason for hiding this comment

viirya Aug 30, 2024

Choose a reason for hiding this comment

andygrove commented Sep 3, 2024

viirya commented Sep 4, 2024

andygrove commented Aug 30, 2024 •

edited

Loading