Skip to content

Conversation

@brandondahler
Copy link

@brandondahler brandondahler commented Aug 1, 2022

What changes were proposed in this pull request?

Adding a new array_sort overload to org.apache.spark.sql.functions that matches the new overload defined in SPARK-29020 and added via #25728.

Why are the changes needed?

Adds access to the new overload for users of the DataFrame API so that they don't need to use the expr escape hatch.

Does this PR introduce any user-facing change?

Yes, now allows users to optionally provide a comparator function to the array_sort, which opens up the ability to sort descending as well as sort items that aren't naturally orderable.

Example:

Old:

df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))");

Added:

df.select(array_sort(col("a"), (x, y) => size(x) - size(y)));

How was this patch tested?

Unit tests updated to validate that the overload matches the expression's behavior.

@github-actions github-actions bot added the SQL label Aug 1, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. are you also interested in adding this in SparkR and PySpark? We can do that in a separate PR.

@brandondahler
Copy link
Author

LGTM. are you also interested in adding this in SparkR and PySpark? We can do that in a separate PR.

I do think they should be added (I checked that they aren't already there), but I don't personally have availability to do so at this time.

@HyukjinKwon
Copy link
Member

Oops, it slipped through my fingers. Mind retriggering https://github.com/brandondahler/spark/runs/7585897593?

@HyukjinKwon
Copy link
Member

cc @zero323, @itholic, @zhengruifeng FYI (since we need to add PySpark and SparkR ones)

@brandondahler
Copy link
Author

Clicked re-run all jobs on that linked run, let me know if there was something else you meant for me to do

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Aug 18, 2022

since pyspark/sql/tests/test_functions.py will check the parity between PySpark and SQL, so I think we may need to add array_sort into expected_missing_in_py

otherwise, LGTM

@zero323
Copy link
Member

zero323 commented Aug 18, 2022

It seems like it has to be re-synced with upstream, to address black failures.

@brandondahler brandondahler force-pushed the features/ArraySortOverload branch from 49743ea to 72d799b Compare August 20, 2022 23:06
@brandondahler
Copy link
Author

Rebased on lastest master changes

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants