Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

  1. Extend the support for Join with different join types. Before this PR, all joins are hardcoded inner type. So this PR supports other join types.
  2. Add join to connect DSL.
  3. Update a few Join proto fields to better reflect the semantic.

Why are the changes needed?

Extend the support for Join in connect.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @cloud-fan

import org.apache.spark.sql.catalyst.analysis.{UnresolvedAlias, UnresolvedAttribute, UnresolvedFunction, UnresolvedRelation, UnresolvedStar}
import org.apache.spark.sql.catalyst.expressions
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, Expression}
import org.apache.spark.sql.catalyst.plans.logical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please keep the package import? This makes it easier to read where the specific classes come from in particular when they have similar names.

So it's easier to ready seeing logical.JoinType and proto.JoinType for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually still there :). IDE seems auto pack things into import org.apache.spark.sql.catalyst.plans.{logical, Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter}

}

test("Basic joins with different join types") {
val connectPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test-case for unspecified as well to see that we catch the error?

Copy link
Contributor Author

@amaliujia amaliujia Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @cloud-fan and we decided to remove both JoinType.CROSS and JoinType.Unspecified.

Proto is our API. We should make the proto itself less ambiguous. For example any proto submitted to the server should not contain a JoinType.Unspecified. It is either set a join type with explicit semantic (inner, left outer, etc.) or not set, which Spark by default treats it as inner join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for unspecified is not the proto contract but the language behavior for different auto generated targets. To avoid issues with defaults, the recommendation in the typical proto style guides is to always have the first element of an enum be unspecified.

Cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the proto is an API, I'd say the join type is a required field and clients must set the join type in the join plan. For the python client, its dataframe API can omit the join type, and the python client should use INNER as the default join type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two layers of this discussion one on the proto infra level and one on the API level. I'm fine with the API level decision.

My point referred to the recommendations when using protos:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah if this is the style guide of protobuf, let's keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the server can simply fail if it sees UNSPECIFIED join type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.

@amaliujia
Copy link
Contributor Author

@cloud-fan comment addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need an unspecified element - https://developers.google.com/protocol-buffers/docs/style#enums

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please maintain the styleguide for the enum values.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@amaliujia
Copy link
Contributor Author

I added the JOIN_TYPE_UNSPECIFIED.

BTW I found JOIN_TYPE_UNSPECIFIED is already tested by SparkPlannerConnectSuite.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 67c6408 Oct 11, 2022
@amaliujia amaliujia deleted the SPARK-40534 branch October 11, 2022 02:23
zhengruifeng pushed a commit that referenced this pull request Oct 17, 2022
…ient

### What changes were proposed in this pull request?

Following up #38157, add all join types support for python. Please note that in the PR we decided to not support CROSS Join now.

### Why are the changes needed?

Finish the join type support in Connect.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

Closes #38243 from amaliujia/python_join_types.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…ient

### What changes were proposed in this pull request?

Following up apache#38157, add all join types support for python. Please note that in the PR we decided to not support CROSS Join now.

### Why are the changes needed?

Finish the join type support in Connect.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT

Closes apache#38243 from amaliujia/python_join_types.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants