Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings.

Why are the changes needed?

API coverage

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

@hvanhovell

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we merge #40050 we should just use the functions in there.

Copy link
Contributor

@hvanhovell hvanhovell Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't have to convert to DataFrame here (as soon as we introduce encoders it might actually be a bit faster if we don't). You could also pass in the columns as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about if we want to keep the same class signature of RelationalGroupedDataset as what it is in SQL. If such class as protected/private class is not needed to match SQL ones, then it is ok to passing in more closer classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor does not need to have the same signature since an end-user is not supposed to instantiate this thing. BTW you are already breaking the signature because we use proto.Expression instead catalyst.Expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess the major thing probably is because this is not a public API.

How about let me follow up in future PRs on what is the final class signature for RelationalGroupedDataset? There are a lot more API to add in this class.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits, looks pretty good!

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.setIsDistinct(false)
// Also special handle count because we need to take care count(*).
case "count" | "size" =>
// Turn count(*) into count(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to take care of count(*)? we don't need it in python client #39622 (comment)

Copy link
Contributor Author

@amaliujia amaliujia Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match existing scala side Dataframe impl. @cloud-fan do you know if we need count(*) to count(1) conversion? If not we can both change here and existing DF.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match existing scala side Dataframe impl.

LGTM, we can update them later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't need it. We can address when we replace this stuff by the functions API.

hvanhovell pushed a commit that referenced this pull request Feb 17, 2023
### What changes were proposed in this pull request?

Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings.

### Why are the changes needed?

API coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #40057 from amaliujia/rw-agg.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit cc471a5)
Signed-off-by: Herman van Hovell <herman@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

Starting to support basic aggregation in Scala client. The first step is to support aggregation by strings.

### Why are the changes needed?

API coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes apache#40057 from amaliujia/rw-agg.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit cc471a5)
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants