[SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type #38768

amaliujia · 2022-11-23T07:02:17Z

What changes were proposed in this pull request?

This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with Expression than str. str could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of str but DataFrame should understand it.

This PR specifically changes for Aggregate.

Why are the changes needed?

Codebase refactoring.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

zhengruifeng · 2022-11-23T08:21:26Z

I think it is time to refactor Column to something like:

class Column:

    def __init__(self, expr: Expression) -> None:
        self.expr = expr
    ...

we'd better try to match the API shape at first, otherwise it will be hard to do so after we make it more complicated.

cc @HyukjinKwon

grundprinzip

I think directional you're right this is the better approach. Plan object should only deal with other plans and expressions, conversion to expressions should happen before the plan is constructed.

grundprinzip · 2022-11-23T10:08:18Z

python/pyspark/sql/connect/dataframe.py

Suggested change

def _map_cols_to_expression(self, fun: str, col: Union[Column, str]) -> Sequence[Expression]:

def _map_cols_to_expression(self, fun: str, col: Union[Expression, str]) -> Sequence[Expression]:

grundprinzip · 2022-11-23T10:08:31Z

python/pyspark/sql/connect/dataframe.py

Suggested change

def min(self, col: Union[Column, str]) -> "DataFrame":

def min(self, col: Union[Expression, str]) -> "DataFrame":

python/pyspark/sql/connect/dataframe.py

grundprinzip · 2022-11-23T10:11:22Z

python/pyspark/sql/connect/plan.py

There is no longer a way to call this with empty measures?

This is a sequence now and I think it can be len=0 which is empty measures?

amaliujia · 2022-11-23T20:22:49Z

@zhengruifeng sure maybe we can do that refactoring in this PR directly.

the plain str passing through to relations became a blocker for that refactoring (given there is no place for a plain str in expression system, it must be wrapped).

amaliujia · 2022-11-24T04:15:25Z

@zhengruifeng @grundprinzip can you take another look?

python/pyspark/sql/connect/dataframe.py

zhengruifeng · 2022-11-24T06:36:45Z

python/pyspark/sql/connect/dataframe.py

not related to this PR, but shall we rename it GroupedData to be the same with pyspark?

grundprinzip

One nit.

grundprinzip · 2022-11-24T07:58:57Z

python/pyspark/sql/connect/dataframe.py

In OSS we have an assert here

>>> df.groupBy("state").agg() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/martin.grund/Development/spark/python/pyspark/sql/group.py", line 162, in agg assert exprs, "exprs should not be empty" AssertionError: exprs should not be empty

HyukjinKwon · 2022-11-24T09:03:57Z

Re: #38768

Yes, please let's refactor Column as we offline discussed. This refactoring has to be done before starting working on functions cc @xinrong-meng too.

… type.

amaliujia · 2022-11-24T17:57:56Z

This PR is a blocker for the refactoring for column. I will refactor based on the change in this PR, which will further unblock functions.

zhengruifeng · 2022-11-25T01:33:07Z

merged into master

… type ### What changes were proposed in this pull request? This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with `Expression` than `str`. `str` could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of `str` but DataFrame should understand it. This PR specifically changes for `Aggregate`. ### Why are the changes needed? Codebase refactoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38768 from amaliujia/SPARK-41230. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added CONNECT CORE PYTHON SQL labels Nov 23, 2022

grundprinzip reviewed Nov 23, 2022

View reviewed changes

zhengruifeng reviewed Nov 24, 2022

View reviewed changes

grundprinzip approved these changes Nov 24, 2022

View reviewed changes

HyukjinKwon approved these changes Nov 24, 2022

View reviewed changes

[SPARK-41230][CONNECT][PYTHON] Remove str from Aggregate expression…

56cd19b

… type.

amaliujia force-pushed the SPARK-41230 branch from b200e35 to 56cd19b Compare November 24, 2022 17:57

zhengruifeng closed this in a205e97 Nov 25, 2022

	def _map_cols_to_expression(self, fun: str, col: Union[Column, str]) -> Sequence[Expression]:
	def _map_cols_to_expression(self, fun: str, col: Union[Expression, str]) -> Sequence[Expression]:

	def min(self, col: Union[Column, str]) -> "DataFrame":
	def min(self, col: Union[Expression, str]) -> "DataFrame":

[SPARK-41230][CONNECT][PYTHON] Remove str from Aggregate expression type #38768

[SPARK-41230][CONNECT][PYTHON] Remove str from Aggregate expression type #38768

Uh oh!

Conversation

amaliujia commented Nov 23, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Nov 23, 2022

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Nov 23, 2022

Uh oh!

amaliujia commented Nov 24, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaliujia commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Nov 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type #38768

[SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type #38768

HyukjinKwon commented Nov 24, 2022 •

edited

Loading

amaliujia commented Nov 24, 2022 •

edited

Loading