[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement #22620

HyukjinKwon · 2018-10-03T11:21:05Z

What changes were proposed in this pull request?

This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("integer", PandasUDFType.GROUPED_AGG)
def sum_udf(v):
    return v.sum()

spark.udf.register("sum_udf", sum_udf)
q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2"
spark.sql(q).show()

+---+-----------+
| v2|sum_udf(v1)|
+---+-----------+
|  1|          1|
|  0|          5|
+---+-----------+

How was this patch tested?

Manual test and unit test.

HyukjinKwon · 2018-10-03T11:22:14Z

cc @BryanCutler, @icexelloss, @gatorsmile and @cloud-fan

HyukjinKwon · 2018-10-03T17:08:43Z

ok to test

HyukjinKwon · 2018-10-03T17:08:49Z

retest this please

icexelloss · 2018-10-03T18:08:09Z

python/pyspark/sql/udf.py

what is the "_ =" thing here?

Hides the output like ...

>>> spark.udf.register("sum_udf", sum_udf) <function sum_udf at 0x103ff18c0>

in the doctest.

Ha. I see..

icexelloss · 2018-10-03T18:39:21Z

LGTM

SparkQA · 2018-10-03T19:43:47Z

Test build #96896 has finished for PR 22620 at commit 06a7bd0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-03T20:16:44Z

python/pyspark/sql/udf.py

how about SQL_WINDOW_AGG_PANDAS_UDF?

We don't need it here:

Users specify GROUPED_AGG only. GROUPED_AGG is turned to WINDOW_AGG eval type in WindowInPandasExec.

Admittedly, there is a bit confusion here we can improve. We just haven't got a user specified udf type that maps to multiple evalType before WINDOW_AGG.

These need to be clearly defined in Apache Spark 3.0 release; otherwise, it might be confusing to both developers and end users. :-)

I opened https://issues.apache.org/jira/browse/SPARK-25640 to track this.

To be clear, this is transparent to end users, but I agree it can be confusing to developers.

BryanCutler

LGTM pending tests. Looks like a test expected a specific error msg
AssertionError: "f must be either SQL_BATCHED_UDF or SQL_SCALAR_PANDAS_UDF" does not match "Invalid f: f must be SQL_BATCHED_UDF, SQL_SCALAR_PANDAS_UDF or SQL_GROUPED_AGG_PANDAS_UDF"

SparkQA · 2018-10-04T00:35:29Z

Test build #96916 has finished for PR 22620 at commit 97d0377.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-04T00:40:48Z

Test build #96917 has finished for PR 22620 at commit f36bc03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-04T00:51:46Z

LGTM

HyukjinKwon · 2018-10-04T01:35:16Z

Thank you @icexelloss, @gatorsmile, @BryanCutler and @viirya.

HyukjinKwon · 2018-10-04T01:35:27Z

Merged to master and branch-2.4.

…for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+-----------+ | v2|sum_udf(v1)| +---+-----------+ | 1| 1| | 0| 5| +---+-----------+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

…for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+-----------+ | v2|sum_udf(v1)| +---+-----------+ | 1| 1| | 0| 5| +---+-----------+ ``` ## How was this patch tested? Manual test and unit test. Closes apache#22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

icexelloss reviewed Oct 3, 2018

View reviewed changes

gatorsmile reviewed Oct 3, 2018

View reviewed changes

Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

f36bc03

HyukjinKwon force-pushed the SPARK-25601 branch from 97d0377 to f36bc03 Compare October 3, 2018 23:58

BryanCutler reviewed Oct 3, 2018

View reviewed changes

asfgit closed this in 79dd4c9 Oct 4, 2018

HyukjinKwon deleted the SPARK-25601 branch October 16, 2018 12:43

[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement #22620

[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement #22620

Uh oh!

Conversation

HyukjinKwon commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Oct 3, 2018

Uh oh!

HyukjinKwon commented Oct 3, 2018

Uh oh!

icexelloss Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

gatorsmile Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Oct 3, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

viirya commented Oct 4, 2018

Uh oh!

HyukjinKwon commented Oct 4, 2018

Uh oh!

HyukjinKwon commented Oct 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Oct 3, 2018 •

edited

Loading

HyukjinKwon commented Oct 3, 2018 •

edited

Loading

HyukjinKwon Oct 3, 2018 •

edited

Loading