-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40534][CONNECT] Extend the support for Join with different join types #38157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,10 +17,11 @@ | |
| package org.apache.spark.sql.connect.planner | ||
|
|
||
| import org.apache.spark.connect.proto | ||
| import org.apache.spark.connect.proto.Join.JoinType | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.dsl.plans._ | ||
| import org.apache.spark.sql.catalyst.expressions.AttributeReference | ||
| import org.apache.spark.sql.catalyst.plans.PlanTest | ||
| import org.apache.spark.sql.catalyst.plans.{FullOuter, Inner, LeftAnti, LeftOuter, LeftSemi, PlanTest, RightOuter} | ||
| import org.apache.spark.sql.catalyst.plans.logical.LocalRelation | ||
|
|
||
| /** | ||
|
|
@@ -32,8 +33,12 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |
|
|
||
| lazy val connectTestRelation = createLocalRelationProto(Seq($"id".int)) | ||
|
|
||
| lazy val connectTestRelation2 = createLocalRelationProto(Seq($"key".int, $"value".int)) | ||
|
|
||
| lazy val sparkTestRelation: LocalRelation = LocalRelation($"id".int) | ||
|
|
||
| lazy val sparkTestRelation2: LocalRelation = LocalRelation($"key".int, $"value".int) | ||
|
|
||
| test("Basic select") { | ||
| val connectPlan = { | ||
| // TODO: Scala only allows one implicit per scope so we keep proto implicit imports in | ||
|
|
@@ -46,6 +51,36 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { | |
| comparePlans(connectPlan.analyze, sparkPlan.analyze, false) | ||
| } | ||
|
|
||
| test("Basic joins with different join types") { | ||
| val connectPlan = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a test-case for unspecified as well to see that we catch the error?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I discussed with @cloud-fan and we decided to remove both Proto is our API. We should make the proto itself less ambiguous. For example any proto submitted to the server should not contain a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason for unspecified is not the proto contract but the language behavior for different auto generated targets. To avoid issues with defaults, the recommendation in the typical proto style guides is to always have the first element of an enum be unspecified. Cc @cloud-fan
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the proto is an API, I'd say the join type is a required field and clients must set the join type in the join plan. For the python client, its dataframe API can omit the join type, and the python client should use INNER as the default join type.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two layers of this discussion one on the proto infra level and one on the API level. I'm fine with the API level decision. My point referred to the recommendations when using protos:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah if this is the style guide of protobuf, let's keep it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And the server can simply fail if it sees
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes exactly. |
||
| import org.apache.spark.sql.connect.dsl.plans._ | ||
| transform(connectTestRelation.join(connectTestRelation2)) | ||
| } | ||
| val sparkPlan = sparkTestRelation.join(sparkTestRelation2) | ||
| comparePlans(connectPlan.analyze, sparkPlan.analyze, false) | ||
|
|
||
| val connectPlan2 = { | ||
| import org.apache.spark.sql.connect.dsl.plans._ | ||
| transform(connectTestRelation.join(connectTestRelation2, condition = None)) | ||
| } | ||
| val sparkPlan2 = sparkTestRelation.join(sparkTestRelation2, condition = None) | ||
| comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false) | ||
| for ((t, y) <- Seq( | ||
| (JoinType.JOIN_TYPE_LEFT_OUTER, LeftOuter), | ||
| (JoinType.JOIN_TYPE_RIGHT_OUTER, RightOuter), | ||
| (JoinType.JOIN_TYPE_FULL_OUTER, FullOuter), | ||
| (JoinType.JOIN_TYPE_LEFT_ANTI, LeftAnti), | ||
| (JoinType.JOIN_TYPE_LEFT_SEMI, LeftSemi), | ||
| (JoinType.JOIN_TYPE_INNER, Inner))) { | ||
| val connectPlan3 = { | ||
| import org.apache.spark.sql.connect.dsl.plans._ | ||
| transform(connectTestRelation.join(connectTestRelation2, t)) | ||
| } | ||
| val sparkPlan3 = sparkTestRelation.join(sparkTestRelation2, y) | ||
| comparePlans(connectPlan3.analyze, sparkPlan3.analyze, false) | ||
| } | ||
| } | ||
|
|
||
| private def createLocalRelationProto(attrs: Seq[AttributeReference]): proto.Relation = { | ||
| val localRelationBuilder = proto.LocalRelation.newBuilder() | ||
| // TODO: set data types for each local relation attribute one proto supports data type. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please keep the package import? This makes it easier to read where the specific classes come from in particular when they have similar names.
So it's easier to ready seeing
logical.JoinTypeandproto.JoinTypefor example.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually still there :). IDE seems auto pack things into
import org.apache.spark.sql.catalyst.plans.{logical, Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter}