-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29020][SQL] Improving array_sort behaviour #25728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you changed behavior here, and I don't think we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change the behavior either without adding a legacy config and a migration guide.
Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I kept the original behaviour of array_sort.
Null will be still placed at the end of the array in ascending order and also in descending.
I completely agree about deprecating sort_array, to simplify.
Gschiavon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion
|
Yes, I think we can't change the behavior here along the way. The null ordering would have to stay as it was |
|
cc @ueshin |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there are two changes in this PR:
- Adding
ascendingOrderargument toArraySort. - Changing the
ArraySort.nullOrderfromNullOrder.GreatesttoNullOrder.Least.- This causes the behavior changes. We need to discuss more. If we want, we need to have a config to back to the previous behavior.
I think we should handle them in separate PRs to be clearer. Here, we should only handle the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can change the behavior either without adding a legacy config and a migration guide.
Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.4.0 -> 3.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asc: Boolean -> asc: Column (see comments on the top of functions.scala)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So asc should be a Column?
Are you referring to this comment?
* This function APIs usually have methods with Column signature only because it can support not * only Column but also other types such as a native string. The other variants currently exist * for historical reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed it
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin yeah I agree, the bigger question is, why do we have both array_sort and sort_array?
@kiszk looks like you added array_sort to match Presto in https://issues.apache.org/jira/browse/SPARK-23921 . Presto supports a comparator function rather than an ascending flag.
If sort_array does what we want already, and is the older method here, why not make array_sort an alias for it? We could deprecate one, but which one? sort_array has been around longer, but array_sort is at least the name in one other system.
Overall I'd just like to avoid duplicating the implementation, at least, even if both are supported as the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these would need to be added to Pyspark as well as R?
@srowen we couldn't make an alias between sort_array and array_sort because they don't have the same null policy, it would change the behaviour of array_sort. |
|
I see, is it that the ordering of nulls is different? |
Okay @srowen , I'll explain the ordering of nulls for all as they are now: And after the PR: So this way the actual behaviour of |
|
Yeah, I guess it's unfortunate that the existing null ordering semantics aren't the same, or else these could be unified. Later, maybe we can expose control over that too and then deprecate one in favor of fully specifying the desired ordering in the other. But for now I'd just add a little bit of documentation pointing out that the behavior is different between array_sort and sort_array. |
|
I think it would be good to have array_sort to be able to order in both ways, all the arrays functions from 2.4.0 are "array_something" and it's kind of confusing to have sort_array. |
|
As the original motivation, we might want to support a comparator function rather than an ascending flag to follow Presto. I think the null ordering is also following Presto, btw. How about adding the comparator function here instead? Then we can make WDYT? |
|
I agree with @ueshin 's opinion about |
|
+1 of @ueshin's |
|
How does it translate for this PR @ueshin ? |
|
Add a Lambda expression for comparator instead of another column In order to expose it in |
|
Ok I understand. Also I think that might be a different PR? and then have different signatures for array_sort. For example: array_sort(column) -> asc by default Or you just want to have array_sort(column, comparatorFunction)? I think having array_sort(column, order) is easier to use, and most of the times you just want to order asc or desc. At the same time having the comparatorFunction can open more possibilities. |
|
I think that we want to have two signatures: This is for compatibility with Presto. |
|
Actually what I had in mind was we should have the two: @kiszk I think you mean this as per your previous comment, right? |
|
I understood the same @ueshin since he said to match presto |
|
Test build #113680 has finished for PR 25728 at commit
|
|
Jenkins, retest this please. |
|
Test build #113721 has finished for PR 25728 at commit
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'd leave this to @kiszk or other reviewers since this includes my code, so I might miss something.
@Gschiavon Could you remove [WIP] from the PR title?
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #113749 has finished for PR 25728 at commit
|
|
ping @kiszk |
| representing two elements of the array. | ||
| It returns -1, 0, or 1 as the first element is less than, equal to, or greater | ||
| than the second element. If the comparator function returns other | ||
| values (including NULL), the query will fail and raise an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: query -> function?
| @ExpressionDescription( | ||
| usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the | ||
| input array must be orderable. Null elements will be placed at the end of the returned | ||
| array. Since 3.0.0 also sorts and returns the array based on the given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement Since 3.0.0 ... looks weird. For example, Since 3.0.0, this fuction sorts and returns ... or others.
| */ | ||
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this phrase in ascending order always true?
|
@Gschiavon Sorry for being late since I made a business trip. I left a few comments. |
|
No problem @kiszk ! I just made some changes based on your comments! |
|
Test build #113964 has finished for PR 25728 at commit
|
|
Merged to master. |
### What changes were proposed in this pull request? This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated. This PR updates the description to follow the latest feature. ### Why are the changes needed? ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests since this PR just updates description text. Closes #27404 from kiszk/SPARK-29020-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…aFrame operations ### What changes were proposed in this pull request? Adding a new `array_sort` overload to `org.apache.spark.sql.functions` that matches the new overload defined in [SPARK-29020](https://issues.apache.org/jira/browse/SPARK-29020) and added via #25728. ### Why are the changes needed? Adds access to the new overload for users of the DataFrame API so that they don't need to use the `expr` escape hatch. ### Does this PR introduce _any_ user-facing change? Yes, now allows users to optionally provide a comparator function to the `array_sort`, which opens up the ability to sort descending as well as sort items that aren't naturally orderable. #### Example: Old: ``` df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))"); ``` Added: ``` df.select(array_sort(col("a"), (x, y) => size(x) - size(y))); ``` ### How was this patch tested? Unit tests updated to validate that the overload matches the expression's behavior. Closes #37361 from brandondahler/features/ArraySortOverload. Authored-by: Brandon Dahler <bnd@amazon.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This pr cleanup the legacy code of `SortArray` to remove `ArraySortLike` and inline `nullOrder`. The `ArraySort` has been rewritten since #25728, so the `SortArray` is the only implementation of `ArraySortLike`. ### Why are the changes needed? cleanup the code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #47547 from ulysses-you/cleanup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>
What changes were proposed in this pull request?
I've noticed that there are two functions to sort arrays sort_array and array_sort.
sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending
array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.
Basically I just added the possibility of ordering either ascending or descending using array_sort.
I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.
Why are the changes needed?
Basically to be able to sort the array in descending order using array_sort instead of using sort_array from 1.5.0
Does this PR introduce any user-facing change?
Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array
How was this patch tested?
Test's added
This is the link to the jira