[SPARK-52187][SQL] Introduce Join pushdown for DSv2 #50921

PetarVasiljevic-DB · 2025-05-16T14:42:32Z

What changes were proposed in this pull request?

With this PR I am introducing the Join pushdown interface for DSv2 connectors and it's implementation for JDBC connectors.

The interface itself, SupportsPushDownJoin has the following API:

public interface SupportsPushDownJoin extends ScanBuilder {
    boolean isRightSideCompatibleForJoin(SupportsPushDownJoin other);

    boolean pushJoin(
            SupportsPushDownJoin other,
            JoinType joinType,
            Optional<Predicate> condition,
            StructType leftRequiredSchema,
            StructType rightRequiredSchema
    );

    StructType getOutputSchema();
}

If isRightSideCompatibleForJoin is true, then the join will be tried to be pushed down (it can still fail though).

getOutputSchema returns the new schema of the ScanBuilder after the join has been pushed down.

With this implementation, only Inner joins are supported. Left and Right joins should be added as well. Cross joins won't be supported since they can increase the amount of data that is being read.

Also, none of the dialects currently supports the join push down. It is only available for H2 dialect. The join push down capability is guarded by SQLConf spark.sql.optimizer.datasourceV2JoinPushdown, JDBC option pushDownJoin and JDBC dialect method supportsJoin.

For the following JDBC query:

SELECT 
    p.cc_call_center_id, 
    q.cc_call_center_id, 
    q.cc_city 
FROM 
    call_center p 
    JOIN 
    call_center q 
    ON p.cc_call_center_id = q.cc_call_center_id

the generated SQL query on spark side would be:

SELECT 
    "subquery_176_col_0",
    "subquery_176_col_1",
    "subquery_176_col_2" FROM (
    SELECT 
        "subquery_174"."cc_call_center_id" AS "subquery_176_col_0",
        "subquery_175"."cc_call_center_id" AS "subquery_176_col_1",
        "subquery_175"."cc_city" AS "subquery_176_col_2" 
        FROM (
            SELECT "cc_call_center_sk", ... FROM "CALL_CENTER"  WHERE ("cc_call_center_id" IS NOT NULL)  
        )"subquery_174"
        INNER JOIN (
            SELECT "cc_call_center_sk", ... FROM "CALL_CENTER"  WHERE ("cc_call_center_id" IS NOT NULL)  
        ) "subquery_175"
        ON "subquery_174"."cc_call_center_id" = "subquery_175"."cc_call_center_id"
) SPARK_GEN_SUBQ_114

Why are the changes needed?

DSv2 connectors can't push down the join operator.

Does this PR introduce any user-facing change?

This PR itself no since the behaviour is not implemented for any of the connectors (besides H2 which is testing JDBC dialect).

How was this patch tested?

New tests and some local testing with TPCDS queries.

Was this patch authored or co-authored using generative AI tooling?

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

cloud-fan · 2025-05-20T06:14:28Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/JoinColumn.java

+    }
+
+    public String[] qualifier;
+    public String name;


why do we need to separate qualifier and name? I think JoinColumn should be the same as NamedReference with an additional isInLeftSideOfJoin flag.

We can have similar implementation as FieldReference where we would have parts and isInLeftSideOfJoin fields, but I find the separation between qualifier and name nicer because it makes code cleaner in some way.

If you take a look at JDBCScanBuilder.pushJoin we are passing the condition that contains JoinColumns as leaf expressions, but these are not yet qualified. I am qualifying these later on, in qualifyCondition method.

Without qualifier-name separation, and with parts:Seq[String] I would need to do array shifting, which is fine but I just find it nicer my way.

I can however change the implementation of JoinColumn if to be something like:

private[sql] final case class JoinColumn( parts: Seq[String], isInLeftSideOfJoin: Boolean) extends NamedReference { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper override def fieldNames(): Array[String] = parts.toArray }

Honestly, I am fine with both

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownJoin.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/JoinTypeSQLBuilder.java

cloud-fan · 2025-05-20T06:17:14Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/JoinTypeSQLBuilder.java

+import java.util.Map;
+
+/**
+ * The builder to generate SQL for specific Join type.


I'm wondering if this is really needed. The join type string is quite simple, and Spark doesn't need to provide a helper to do it.

It might be redundant. The reason why I have it is simply the answer to the following question: what if some dialect calls the specific join type differently. For example, what if there is a dialect that doesn't support CROSS JOIN but only JOIN syntax.

We can get same effect with just string comparison in the dialects, so we can get rid of it if you find it as an overkill.

cloud-fan · 2025-05-20T06:18:35Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java

@@ -174,6 +178,12 @@ protected String visitNamedReference(NamedReference namedRef) {
    return namedRef.toString();
  }

+  protected String visitJoinColumn(JoinColumn column) {


shall we fail by default? the implementations must provide the left/right side alias as a context, in order to generate the column name.

the implementations must provide the left/right side alias as a context, in order to generate the column name.

not really.. The way I designed this is that you are already going to have left/right side alias in JoinColumn before visiting it. So I think this implementation is valid.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

cloud-fan · 2025-05-20T06:34:37Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+      // SALARY#0, NAME#1, DEPT#1. This is done by adding projection with appropriate aliases.
+      val projectList = realOutput.zip(holder.output).map { case (a1, a2) =>
+        val originalName = holder.exprIdToOriginalName(a2.exprId)
+        Alias(a1, originalName)(a2.exprId)


is originalName always a2.name?

No. a2 is coming from holder.output that will have aliased names in format subquery_x_col_y.

Original names are saved into sHolder at the time of it's creation in createScanBuilder.

Does that answer your question?

cloud-fan · 2025-05-20T06:35:29Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

@@ -573,6 +701,13 @@ case class ScanBuilderHolder(
  var pushedAggregate: Option[Aggregation] = None

  var pushedAggOutputMap: AttributeMap[Expression] = AttributeMap.empty[Expression]
+
+  var joinedRelations: Seq[DataSourceV2RelationBase] = Seq()


does joinedRelations.isEmpty indicate isJoinPushed as false?

Yes, we can reuse joinedRelations.isEmpty instead of isJoinPushed. I will do that change.

beliefer · 2025-05-20T09:22:34Z

@cloud-fan Shall we support join pushdown for DSV2 ?

beliefer · 2025-05-21T02:33:47Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/Inner.java

+import org.apache.spark.annotation.Evolving;
+
+/**
+ * Base class of the public Join type API.


Please correct this comment.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/JoinColumn.java

PetarVasiljevic-DB · 2025-05-26T09:03:03Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/Inner.java

@@ -0,0 +1,23 @@
+/*


I plan to add other types of joins as well.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

cloud-fan · 2025-05-27T09:12:26Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+
+        val newSchema = leftHolder.builder.build().readSchema()
+        val newOutput = (leftProjections ++ rightProjections).asInstanceOf[Seq[AttributeReference]]
+          .zip(newSchema.fields)


We should fail if the number of columns doesn't match between Spark and the third-party data source

also check the data type.

cloud-fan · 2025-05-27T09:17:23Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+        )) {
+        leftHolder.joinedRelations = leftHolder.joinedRelations :+ rightHolder.relation
+
+        val newSchema = leftHolder.builder.build().readSchema()


let's not build the scan too early here, or call it an extra time.

I have introduced new API for the Join interface that will return the new schema after the join is pushed down.

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala

cloud-fan · 2025-05-27T09:28:57Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala

+    val conditionString = condition.toScala match {
+      case Some(cond) =>
+        qualifyCondition(cond, leftSideQualifier, rightSideQualifier)
+        s"ON ${dialect.compileExpression(cond).get}"


I think it's safer to pass the generated subquery aliases to the compileExpression function (or add a new compileJoinCondition function), which should respect the aliases when generating SQL for JoinColumn. It's better than making JoinColumn mutable.

Adding new method is a bit tricky because we would need to fallback to compileExpression in case the expression is not JoinColumn. We can't easily call the overriden compileExpression methods.

I went with expanding JDBCSQLBuilder with left and right qualifier opt. It seems like an overkill to just support JoinColumn but I think this is the safest way.

common/utils/src/main/resources/error/error-conditions.json

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/JoinTypeSQLBuilder.java

beliefer · 2025-05-27T10:27:01Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/JoinColumn.java

+    String[] fullyQualified = new String[qualifier.length + 1];
+    System.arraycopy(qualifier, 0, fullyQualified, 0, qualifier.length);
+    fullyQualified[qualifier.length] = name;
+    return fullyQualified;


It seems JoinColumn is a unmodified class, should we cache the fullyQualified even if copy array here in case of called many times.

I have changed the way JoinColumn works. It doesn't have the qualifier anymore, it has only the name. Third party connector should handle the qualifiers, similarly to how it's done in JDBCSQLBuilder

cloud-fan · 2025-05-29T14:26:40Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/JoinColumn.java

+    this.leftSideOfJoin = leftSideOfJoin;
+  }
+
+  private String name;


shall we also use String[] in case we want to support join condition with nested columns in the future?

github-actions bot added the SQL label May 16, 2025

PetarVasiljevic-DB changed the title ~~introduce join pushdown for dsv2~~ [SPARK-52187] Introduce Join pushdown for DSv2 May 16, 2025

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch from c90f33e to ea86140 Compare May 16, 2025 14:48

introduce join pushdown for dsv2

ecb5608

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch from ea86140 to ecb5608 Compare May 17, 2025 16:46

HyukjinKwon changed the title ~~[SPARK-52187] Introduce Join pushdown for DSv2~~ [SPARK-52187][SQL] Introduce Join pushdown for DSv2 May 19, 2025

alekjarmov reviewed May 19, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 20, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownJoin.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 20, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/JoinTypeSQLBuilder.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 20, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 20, 2025

View reviewed changes

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch 2 times, most recently from 98350e2 to ba482f1 Compare May 20, 2025 15:29

nits and refactor

2a02d6f

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch from ba482f1 to 2a02d6f Compare May 20, 2025 15:32

beliefer reviewed May 21, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/join/JoinColumn.java Outdated Show resolved Hide resolved

nit

a4c96ff

PetarVasiljevic-DB commented May 26, 2025

View reviewed changes

cloud-fan reviewed May 27, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 27, 2025

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 27, 2025

View reviewed changes

beliefer reviewed May 27, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

beliefer reviewed May 27, 2025

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

beliefer reviewed May 27, 2025

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/JoinTypeSQLBuilder.java Outdated Show resolved Hide resolved

beliefer reviewed May 27, 2025

View reviewed changes

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch from 805d7ad to ef1ca7c Compare May 28, 2025 12:40

refactor

227289e

PetarVasiljevic-DB force-pushed the support_join_for_dsv2 branch from ef1ca7c to 227289e Compare May 29, 2025 11:38

cloud-fan reviewed May 29, 2025

View reviewed changes

[SPARK-52187][SQL] Introduce Join pushdown for DSv2 #50921

Are you sure you want to change the base?

[SPARK-52187][SQL] Introduce Join pushdown for DSv2 #50921

Uh oh!

Conversation

PetarVasiljevic-DB commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer commented May 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PetarVasiljevic-DB commented May 16, 2025 •

edited

Loading

cloud-fan May 27, 2025 •

edited

Loading