-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement #22620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @BryanCutler, @icexelloss, @gatorsmile and @cloud-fan |
|
ok to test |
|
retest this please |
python/pyspark/sql/udf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the "_ =" thing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hides the output like ...
>>> spark.udf.register("sum_udf", sum_udf)
<function sum_udf at 0x103ff18c0>
in the doctest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha. I see..
|
LGTM |
|
Test build #96896 has finished for PR 22620 at commit
|
python/pyspark/sql/udf.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about SQL_WINDOW_AGG_PANDAS_UDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it here:
Users specify GROUPED_AGG only. GROUPED_AGG is turned to WINDOW_AGG eval type in WindowInPandasExec.
Admittedly, there is a bit confusion here we can improve. We just haven't got a user specified udf type that maps to multiple evalType before WINDOW_AGG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be clearly defined in Apache Spark 3.0 release; otherwise, it might be confusing to both developers and end users. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened https://issues.apache.org/jira/browse/SPARK-25640 to track this.
To be clear, this is transparent to end users, but I agree it can be confusing to developers.
97d0377 to
f36bc03
Compare
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending tests. Looks like a test expected a specific error msg
AssertionError: "f must be either SQL_BATCHED_UDF or SQL_SCALAR_PANDAS_UDF" does not match "Invalid f: f must be SQL_BATCHED_UDF, SQL_SCALAR_PANDAS_UDF or SQL_GROUPED_AGG_PANDAS_UDF"
|
Test build #96916 has finished for PR 22620 at commit
|
|
Test build #96917 has finished for PR 22620 at commit
|
|
LGTM |
|
Thank you @icexelloss, @gatorsmile, @BryanCutler and @viirya. |
|
Merged to master and branch-2.4. |
…for SQL Statement
## What changes were proposed in this pull request?
This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()
spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
spark.sql(q).show()
```
```
+---+-----------+
| v2|sum_udf(v1)|
+---+-----------+
| 1| 1|
| 0| 5|
+---+-----------+
```
## How was this patch tested?
Manual test and unit test.
Closes #22620 from HyukjinKwon/SPARK-25601.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…for SQL Statement
## What changes were proposed in this pull request?
This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()
spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
spark.sql(q).show()
```
```
+---+-----------+
| v2|sum_udf(v1)|
+---+-----------+
| 1| 1|
| 0| 5|
+---+-----------+
```
## How was this patch tested?
Manual test and unit test.
Closes #22620 from HyukjinKwon/SPARK-25601.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…for SQL Statement
## What changes were proposed in this pull request?
This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()
spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
spark.sql(q).show()
```
```
+---+-----------+
| v2|sum_udf(v1)|
+---+-----------+
| 1| 1|
| 0| 5|
+---+-----------+
```
## How was this patch tested?
Manual test and unit test.
Closes apache#22620 from HyukjinKwon/SPARK-25601.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
…for SQL Statement
## What changes were proposed in this pull request?
This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
return v.sum()
spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
spark.sql(q).show()
```
```
+---+-----------+
| v2|sum_udf(v1)|
+---+-----------+
| 1| 1|
| 0| 5|
+---+-----------+
```
## How was this patch tested?
Manual test and unit test.
Closes apache#22620 from HyukjinKwon/SPARK-25601.
Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:
How was this patch tested?
Manual test and unit test.