-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40534][CONNECT] Extend the support for Join with different join types #38157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @cloud-fan |
| import org.apache.spark.sql.catalyst.analysis.{UnresolvedAlias, UnresolvedAttribute, UnresolvedFunction, UnresolvedRelation, UnresolvedStar} | ||
| import org.apache.spark.sql.catalyst.expressions | ||
| import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, Expression} | ||
| import org.apache.spark.sql.catalyst.plans.logical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please keep the package import? This makes it easier to read where the specific classes come from in particular when they have similar names.
So it's easier to ready seeing logical.JoinType and proto.JoinType for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually still there :). IDE seems auto pack things into import org.apache.spark.sql.catalyst.plans.{logical, Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter}
| } | ||
|
|
||
| test("Basic joins with different join types") { | ||
| val connectPlan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test-case for unspecified as well to see that we catch the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @cloud-fan and we decided to remove both JoinType.CROSS and JoinType.Unspecified.
Proto is our API. We should make the proto itself less ambiguous. For example any proto submitted to the server should not contain a JoinType.Unspecified. It is either set a join type with explicit semantic (inner, left outer, etc.) or not set, which Spark by default treats it as inner join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for unspecified is not the proto contract but the language behavior for different auto generated targets. To avoid issues with defaults, the recommendation in the typical proto style guides is to always have the first element of an enum be unspecified.
Cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the proto is an API, I'd say the join type is a required field and clients must set the join type in the join plan. For the python client, its dataframe API can omit the join type, and the python client should use INNER as the default join type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two layers of this discussion one on the proto infra level and one on the API level. I'm fine with the API level decision.
My point referred to the recommendations when using protos:
- All enums must have an unspecified element due to the different ways languages deal with default construction (cf https://developers.google.com/protocol-buffers/docs/style#enums)
- There is no required field, all elements are always optional in proto3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah if this is the style guide of protobuf, let's keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the server can simply fail if it sees UNSPECIFIED join type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly.
|
@cloud-fan comment addressed. |
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need an unspecified element - https://developers.google.com/protocol-buffers/docs/style#enums
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please maintain the styleguide for the enum values.
|
Can one of the admins verify this patch? |
3a131bc to
80a7ede
Compare
|
I added the JOIN_TYPE_UNSPECIFIED. BTW I found |
|
thanks, merging to master! |
…ient ### What changes were proposed in this pull request? Following up #38157, add all join types support for python. Please note that in the PR we decided to not support CROSS Join now. ### Why are the changes needed? Finish the join type support in Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT Closes #38243 from amaliujia/python_join_types. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…ient ### What changes were proposed in this pull request? Following up apache#38157, add all join types support for python. Please note that in the PR we decided to not support CROSS Join now. ### Why are the changes needed? Finish the join type support in Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT Closes apache#38243 from amaliujia/python_join_types. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
innertype. So this PR supports other join types.Why are the changes needed?
Extend the support for Join in connect.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT