[SPARK-29020][SQL] Improving array_sort behaviour #25728

Gschiavon · 2019-09-09T08:19:06Z

What changes were proposed in this pull request?

I've noticed that there are two functions to sort arrays sort_array and array_sort.

sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending

array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.

Basically I just added the possibility of ordering either ascending or descending using array_sort.

I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.

Why are the changes needed?

Basically to be able to sort the array in descending order using array_sort instead of using sort_array from 1.5.0

Does this PR introduce any user-facing change?

Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array

How was this patch tested?

Test's added

This is the link to the jira

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

srowen · 2019-09-09T10:33:16Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

It looks like you changed behavior here, and I don't think we can do that.

I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion

I don't think we can change the behavior either without adding a legacy config and a migration guide.

Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.

I agree, I kept the original behaviour of array_sort.
Null will be still placed at the end of the array in ascending order and also in descending.

I completely agree about deprecating sort_array, to simplify.

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Gschiavon

Corrected

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Gschiavon · 2019-09-09T13:13:28Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion

srowen · 2019-09-09T15:20:31Z

Yes, I think we can't change the behavior here along the way. The null ordering would have to stay as it was

gatorsmile · 2019-09-09T21:39:17Z

cc @ueshin

ueshin

Seems like there are two changes in this PR:

Adding ascendingOrder argument to ArraySort.
Changing the ArraySort.nullOrder from NullOrder.Greatest to NullOrder.Least.
- This causes the behavior changes. We need to discuss more. If we want, we need to have a config to back to the previous behavior.

I think we should handle them in separate PRs to be clearer. Here, we should only handle the first one.

ueshin · 2019-09-09T22:39:21Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

I don't think we can change the behavior either without adding a legacy config and a migration guide.

Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.

HyukjinKwon · 2019-09-10T00:46:48Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

2.4.0 -> 3.0.0

HyukjinKwon · 2019-09-10T00:47:28Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

asc: Boolean -> asc: Column (see comments on the top of functions.scala)

So asc should be a Column?

Are you referring to this comment?

* This function APIs usually have methods with Column signature only because it can support not * only Column but also other types such as a native string. The other variants currently exist * for historical reasons.

I just changed it

srowen

@ueshin yeah I agree, the bigger question is, why do we have both array_sort and sort_array?

@kiszk looks like you added array_sort to match Presto in https://issues.apache.org/jira/browse/SPARK-23921 . Presto supports a comparator function rather than an ascending flag.

If sort_array does what we want already, and is the older method here, why not make array_sort an alias for it? We could deprecate one, but which one? sort_array has been around longer, but array_sort is at least the name in one other system.

Overall I'd just like to avoid duplicating the implementation, at least, even if both are supported as the function name.

srowen · 2019-09-10T14:10:10Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

I think these would need to be added to Pyspark as well as R?

Gschiavon · 2019-09-10T15:00:35Z

@ueshin yeah I agree, the bigger question is, why do we have both array_sort and sort_array?

@kiszk looks like you added array_sort to match Presto in https://issues.apache.org/jira/browse/SPARK-23921 . Presto supports a comparator function rather than an ascending flag.

If sort_array does what we want already, and is the older method here, why not make array_sort an alias for it? We could deprecate one, but which one? sort_array has been around longer, but array_sort is at least the name in one other system.

Overall I'd just like to avoid duplicating the implementation, at least, even if both are supported as the function name.

@srowen we couldn't make an alias between sort_array and array_sort because they don't have the same null policy, it would change the behaviour of array_sort.

srowen · 2019-09-10T15:06:25Z

I see, is it that the ordering of nulls is different?
I think we'd at least want the doc of each to comment on the difference with the other very similar method, to make this clear. I suppose one day we may add control over null ordering and deprecate one in favor of the other.

Gschiavon · 2019-09-10T15:35:19Z

I see, is it that the ordering of nulls is different?
I think we'd at least want the doc of each to comment on the difference with the other very similar method, to make this clear. I suppose one day we may add control over null ordering and deprecate one in favor of the other.

Okay @srowen , I'll explain the ordering of nulls for all as they are now:
sort_array in ascending order nulls are at the beginning
sort_array in descending order nulls are at the end
array_sort in ascending order nulls are at the end

And after the PR:
sort_array will remain the same
array_sort in ascending order nulls are at the end
array_sort in descending order nulls are at the end

So this way the actual behaviour of array_sort remains the same.

srowen · 2019-09-10T15:41:42Z

Yeah, I guess it's unfortunate that the existing null ordering semantics aren't the same, or else these could be unified. Later, maybe we can expose control over that too and then deprecate one in favor of fully specifying the desired ordering in the other.

But for now I'd just add a little bit of documentation pointing out that the behavior is different between array_sort and sort_array.

Gschiavon · 2019-09-10T17:06:50Z

I think it would be good to have array_sort to be able to order in both ways, all the arrays functions from 2.4.0 are "array_something" and it's kind of confusing to have sort_array.
I think it would be great to have that unified.

ueshin · 2019-09-10T17:07:25Z

As the original motivation, we might want to support a comparator function rather than an ascending flag to follow Presto. I think the null ordering is also following Presto, btw.

How about adding the comparator function here instead? Then we can make sort_array as Unvaluable and replace it to array_sort with appropriate comparator in Analyzer to share the implementation if still needed in the following PRs. I'm not sure we should deprecate sort_array in that case, though.

WDYT?

kiszk · 2019-09-10T17:28:49Z

I agree with @ueshin 's opinion about array_sort. It would be great to add the comparator function instead of a flag. The bigger picture of https://issues.apache.org/jira/browse/SPARK-23921 is to follow array_sort function as Presto implements.

HyukjinKwon · 2019-09-11T01:05:43Z

+1 of @ueshin's

Gschiavon · 2019-09-11T03:45:47Z

How does it translate for this PR @ueshin ?

HyukjinKwon · 2019-09-11T04:58:18Z

Add a Lambda expression for comparator instead of another column asc in array_sort. See array_sort(array(T), function(T, T, int)) -> array(T) at https://prestodb.github.io/docs/current/functions/array.html

In order to expose it in functions.scala, you can refer #24232.

Gschiavon · 2019-09-12T10:35:31Z

Ok I understand. Also I think that might be a different PR? and then have different signatures for array_sort. For example:

array_sort(column) -> asc by default
array_sort(column, order)
array_sort(column, comparatorFunction)

Or you just want to have array_sort(column, comparatorFunction)?

I think having array_sort(column, order) is easier to use, and most of the times you just want to order asc or desc. At the same time having the comparatorFunction can open more possibilities.

kiszk · 2019-09-12T11:35:32Z

I think that we want to have two signatures:

array_sort(column) -> asc by default
array_sort(column, order)

This is for compatibility with Presto.

ueshin · 2019-09-12T20:05:27Z

Actually what I had in mind was we should have the two:

array_sort(column) -> asc by default
array_sort(column, comparatorFunction)

@kiszk I think you mean this as per your previous comment, right?

Gschiavon · 2019-09-13T05:05:12Z

I understood the same @ueshin since he said to match presto

SparkQA · 2019-11-13T11:41:28Z

Test build #113680 has finished for PR 25728 at commit 53b563e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-11-13T17:59:25Z

Jenkins, retest this please.

SparkQA · 2019-11-13T21:56:58Z

Test build #113721 has finished for PR 25728 at commit 53b563e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM.
I'd leave this to @kiszk or other reviewers since this includes my code, so I might miss something.

@Gschiavon Could you remove [WIP] from the PR title?

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

SparkQA · 2019-11-14T07:50:08Z

Test build #113749 has finished for PR 25728 at commit 7c38b8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Gschiavon · 2019-11-17T14:06:38Z

ping @kiszk

kiszk · 2019-11-17T18:41:24Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    representing two elements of the array.
+    It returns -1, 0, or 1 as the first element is less than, equal to, or greater
+    than the second element. If the comparator function returns other
+    values (including NULL), the query will fail and raise an error.


nit: query -> function?

kiszk · 2019-11-17T18:43:19Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+@ExpressionDescription(
+  usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the
+    input array must be orderable. Null elements will be placed at the end of the returned
+    array. Since 3.0.0 also sorts and returns the array based on the given


This statement Since 3.0.0 ... looks weird. For example, Since 3.0.0, this fuction sorts and returns ... or others.

kiszk · 2019-11-17T18:48:31Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the


Is this phrase in ascending order always true?

kiszk · 2019-11-17T19:17:21Z

@Gschiavon Sorry for being late since I made a business trip. I left a few comments.

Gschiavon · 2019-11-17T19:18:13Z

No problem @kiszk ! I just made some changes based on your comments!

SparkQA · 2019-11-17T22:50:16Z

Test build #113964 has finished for PR 25728 at commit ef28d4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-18T07:06:57Z

Merged to master.

### What changes were proposed in this pull request? This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated. This PR updates the description to follow the latest feature. ### Why are the changes needed? ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests since this PR just updates description text. Closes #27404 from kiszk/SPARK-29020-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…aFrame operations ### What changes were proposed in this pull request? Adding a new `array_sort` overload to `org.apache.spark.sql.functions` that matches the new overload defined in [SPARK-29020](https://issues.apache.org/jira/browse/SPARK-29020) and added via #25728. ### Why are the changes needed? Adds access to the new overload for users of the DataFrame API so that they don't need to use the `expr` escape hatch. ### Does this PR introduce _any_ user-facing change? Yes, now allows users to optionally provide a comparator function to the `array_sort`, which opens up the ability to sort descending as well as sort items that aren't naturally orderable. #### Example: Old: ``` df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))"); ``` Added: ``` df.select(array_sort(col("a"), (x, y) => size(x) - size(y))); ``` ### How was this patch tested? Unit tests updated to validate that the overload matches the expression's behavior. Closes #37361 from brandondahler/features/ArraySortOverload. Authored-by: Brandon Dahler <bnd@amazon.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This pr cleanup the legacy code of `SortArray` to remove `ArraySortLike` and inline `nullOrder`. The `ArraySort` has been rewritten since #25728, so the `SortArray` is the only implementation of `ArraySortLike`. ### Why are the changes needed? cleanup the code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #47547 from ulysses-you/cleanup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>

srowen reviewed Sep 9, 2019

View reviewed changes

Gschiavon commented Sep 9, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-29020] Improving array_sort behaviour~~ [SPARK-29020][SQL] Improving array_sort behaviour Sep 9, 2019

dongjoon-hyun added the SQL label Sep 9, 2019

ueshin reviewed Sep 9, 2019

View reviewed changes

HyukjinKwon reviewed Sep 10, 2019

View reviewed changes

srowen reviewed Sep 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/functions.scala Outdated

Copy link

Member

srowen Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these would need to be added to Pyspark as well as R?

Gschiavon closed this Sep 10, 2019

Gschiavon reopened this Sep 10, 2019

gschiavon added 4 commits September 15, 2019 21:29

[SPARK-29020] Improving array_sort behaviour

134e9e4

[SPARK-29020] [SQL] Keep array_sort original behaviour

aeee71c

[SPARK-29020] [SQL] ascending parameter is now a Column

ebde544

Array sorting as HOF with Integers Asc and Desc

3f4c328

Fix lambda function names and readabilty of query examples

53b563e

ueshin approved these changes Nov 13, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala Outdated Show resolved Hide resolved

Gschiavon changed the title ~~[SPARK-29020][WIP][SQL] Improving array_sort behaviour~~ [SPARK-29020][SQL] Improving array_sort behaviour Nov 14, 2019

revert indent

7c38b8a

kiszk reviewed Nov 17, 2019

View reviewed changes

Changed ArraySort description

ef28d4f

HyukjinKwon approved these changes Nov 18, 2019

View reviewed changes

HyukjinKwon closed this in 7391237 Nov 18, 2019

kiszk mentioned this pull request Jan 30, 2020

[SPARK-29020][FOLLOWUP][SQL] Update description of array_sort function #27404

Closed

brandondahler mentioned this pull request Aug 1, 2022

[SPARK-39925][SQL] Add array_sort(column, comparator) overload to DataFrame operations #37361

Closed

ulysses-you mentioned this pull request Jul 31, 2024

[SPARK-49071][SQL] Remove ArraySortLike trait #47547

Closed

[SPARK-29020][SQL] Improving array_sort behaviour #25728

[SPARK-29020][SQL] Improving array_sort behaviour #25728

Uh oh!

Conversation

Gschiavon commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Gschiavon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 9, 2019

Uh oh!

gatorsmile commented Sep 9, 2019

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gschiavon Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gschiavon commented Sep 10, 2019

Uh oh!

srowen commented Sep 10, 2019

Uh oh!

Gschiavon commented Sep 10, 2019

Uh oh!

srowen commented Sep 10, 2019

Uh oh!

Gschiavon commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Sep 10, 2019

Uh oh!

HyukjinKwon commented Sep 11, 2019

Uh oh!

Gschiavon commented Sep 11, 2019

Uh oh!

HyukjinKwon commented Sep 11, 2019

Uh oh!

Gschiavon commented Sep 12, 2019

Uh oh!

kiszk commented Sep 12, 2019

Uh oh!

ueshin commented Sep 12, 2019

Gschiavon commented Sep 9, 2019 •

edited

Loading

Gschiavon Sep 10, 2019 •

edited

Loading

Gschiavon commented Sep 10, 2019 •

edited

Loading

ueshin commented Sep 10, 2019 •

edited

Loading