-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42468][CONNECT] Implement agg by (String, String)* #40057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we merge #40050 we should just use the functions in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't have to convert to DataFrame here (as soon as we introduce encoders it might actually be a bit faster if we don't). You could also pass in the columns as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about if we want to keep the same class signature of RelationalGroupedDataset as what it is in SQL. If such class as protected/private class is not needed to match SQL ones, then it is ok to passing in more closer classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor does not need to have the same signature since an end-user is not supposed to instantiate this thing. BTW you are already breaking the signature because we use proto.Expression instead catalyst.Expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess the major thing probably is because this is not a public API.
How about let me follow up in future PRs on what is the final class signature for RelationalGroupedDataset? There are a lot more API to add in this class.
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
Outdated
Show resolved
Hide resolved
connector/connect/common/src/test/resources/query-tests/explain-results/groupby_agg.explain
Outdated
Show resolved
Hide resolved
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits, looks pretty good!
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| .setIsDistinct(false) | ||
| // Also special handle count because we need to take care count(*). | ||
| case "count" | "size" => | ||
| // Turn count(*) into count(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to take care of count(*)? we don't need it in python client #39622 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to match existing scala side Dataframe impl. @cloud-fan do you know if we need count(*) to count(1) conversion? If not we can both change here and existing DF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to match existing scala side Dataframe impl.
LGTM, we can update them later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we don't need it. We can address when we replace this stuff by the functions API.
### What changes were proposed in this pull request? Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40057 from amaliujia/rw-agg. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit cc471a5) Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#40057 from amaliujia/rw-agg. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit cc471a5) Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings.
Why are the changes needed?
API coverage
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT