-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41230][CONNECT][PYTHON] Remove str from Aggregate expression type
#38768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think it is time to refactor we'd better try to match the API shape at first, otherwise it will be hard to do so after we make it more complicated. cc @HyukjinKwon |
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think directional you're right this is the better approach. Plan object should only deal with other plans and expressions, conversion to expressions should happen before the plan is constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _map_cols_to_expression(self, fun: str, col: Union[Column, str]) -> Sequence[Expression]: | |
| def _map_cols_to_expression(self, fun: str, col: Union[Expression, str]) -> Sequence[Expression]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def min(self, col: Union[Column, str]) -> "DataFrame": | |
| def min(self, col: Union[Expression, str]) -> "DataFrame": |
python/pyspark/sql/connect/plan.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no longer a way to call this with empty measures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sequence now and I think it can be len=0 which is empty measures?
|
@zhengruifeng sure maybe we can do that refactoring in this PR directly. the plain |
|
@zhengruifeng @grundprinzip can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but shall we rename it GroupedData to be the same with pyspark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OSS we have an assert here
>>> df.groupBy("state").agg()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/martin.grund/Development/spark/python/pyspark/sql/group.py", line 162, in agg
assert exprs, "exprs should not be empty"
AssertionError: exprs should not be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
Re: #38768 Yes, please let's refactor |
b200e35 to
56cd19b
Compare
|
This PR is a blocker for the refactoring for column. I will refactor based on the change in this PR, which will further unblock functions. |
|
merged into master |
… type ### What changes were proposed in this pull request? This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with `Expression` than `str`. `str` could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of `str` but DataFrame should understand it. This PR specifically changes for `Aggregate`. ### Why are the changes needed? Codebase refactoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38768 from amaliujia/SPARK-41230. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
… type ### What changes were proposed in this pull request? This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with `Expression` than `str`. `str` could be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference of `str` but DataFrame should understand it. This PR specifically changes for `Aggregate`. ### Why are the changes needed? Codebase refactoring. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38768 from amaliujia/SPARK-41230. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
This PR proposes that Relations (e.g. Aggregate in this PR) should only deal with
Expressionthanstr.strcould be mapped to different expressions (e.g. sql expression, unresolved_attribute, etc.). Relations are not supposed to understand the difference ofstrbut DataFrame should understand it.This PR specifically changes for
Aggregate.Why are the changes needed?
Codebase refactoring.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT