Skip to content

Conversation

@amaliujia
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with Expression than str. str could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of str but DataFrame should understand it.

This PR specifically changes for Aggregate.

Why are the changes needed?

Codebase refactoring.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@zhengruifeng
Copy link
Contributor

I think it is time to refactor Column to something like:

class Column:

    def __init__(self, expr: Expression) -> None:
        self.expr = expr
    ...

we'd better try to match the API shape at first, otherwise it will be hard to do so after we make it more complicated.

cc @HyukjinKwon

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think directional you're right this is the better approach. Plan object should only deal with other plans and expressions, conversion to expressions should happen before the plan is constructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _map_cols_to_expression(self, fun: str, col: Union[Column, str]) -> Sequence[Expression]:
def _map_cols_to_expression(self, fun: str, col: Union[Expression, str]) -> Sequence[Expression]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def min(self, col: Union[Column, str]) -> "DataFrame":
def min(self, col: Union[Expression, str]) -> "DataFrame":

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no longer a way to call this with empty measures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sequence now and I think it can be len=0 which is empty measures?

@amaliujia
Copy link
Contributor Author

@zhengruifeng sure maybe we can do that refactoring in this PR directly.

the plain str passing through to relations became a blocker for that refactoring (given there is no place for a plain str in expression system, it must be wrapped).

@amaliujia
Copy link
Contributor Author

@zhengruifeng @grundprinzip can you take another look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but shall we rename it GroupedData to be the same with pyspark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Copy link
Contributor

@grundprinzip grundprinzip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OSS we have an assert here

>>> df.groupBy("state").agg()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/martin.grund/Development/spark/python/pyspark/sql/group.py", line 162, in agg
    assert exprs, "exprs should not be empty"
AssertionError: exprs should not be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 24, 2022

Re: #38768

Yes, please let's refactor Column as we offline discussed. This refactoring has to be done before starting working on functions cc @xinrong-meng too.

@amaliujia
Copy link
Contributor Author

amaliujia commented Nov 24, 2022

This PR is a blocker for the refactoring for column. I will refactor based on the change in this PR, which will further unblock functions.

@zhengruifeng
Copy link
Contributor

merged into master

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 15, 2022
… type

### What changes were proposed in this pull request?

This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with  `Expression` than `str`. `str` could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of `str` but DataFrame should understand it.

This PR specifically changes for `Aggregate`.

### Why are the changes needed?

Codebase refactoring.

### Does this PR introduce _any_ user-facing change?

 No

### How was this patch tested?

UT

Closes apache#38768 from amaliujia/SPARK-41230.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
… type

### What changes were proposed in this pull request?

This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with  `Expression` than `str`. `str` could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of `str` but DataFrame should understand it.

This PR specifically changes for `Aggregate`.

### Why are the changes needed?

Codebase refactoring.

### Does this PR introduce _any_ user-facing change?

 No

### How was this patch tested?

UT

Closes apache#38768 from amaliujia/SPARK-41230.

Authored-by: Rui Wang <rui.wang@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants