Skip to content

Conversation

@Gschiavon
Copy link
Contributor

@Gschiavon Gschiavon commented Sep 9, 2019

What changes were proposed in this pull request?

I've noticed that there are two functions to sort arrays sort_array and array_sort.

sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending

array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.

Basically I just added the possibility of ordering either ascending or descending using array_sort.

I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.

Why are the changes needed?

Basically to be able to sort the array in descending order using array_sort instead of using sort_array from 1.5.0

Does this PR introduce any user-facing change?

Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array

How was this patch tested?

Test's added

This is the link to the jira

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you changed behavior here, and I don't think we can do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can change the behavior either without adding a legacy config and a migration guide.

Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I kept the original behaviour of array_sort.
Null will be still placed at the end of the array in ascending order and also in descending.

I completely agree about deprecating sort_array, to simplify.

Copy link
Contributor Author

@Gschiavon Gschiavon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to leave nulls always at the end of the array, that won't change the behaviour.
I was just unifying behaviours between sort_array and array_sort.
Let me know if you have any suggestion

@srowen
Copy link
Member

srowen commented Sep 9, 2019

Yes, I think we can't change the behavior here along the way. The null ordering would have to stay as it was

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29020] Improving array_sort behaviour [SPARK-29020][SQL] Improving array_sort behaviour Sep 9, 2019
@gatorsmile
Copy link
Member

cc @ueshin

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are two changes in this PR:

  • Adding ascendingOrder argument to ArraySort.
  • Changing the ArraySort.nullOrder from NullOrder.Greatest to NullOrder.Least.
    • This causes the behavior changes. We need to discuss more. If we want, we need to have a config to back to the previous behavior.

I think we should handle them in separate PRs to be clearer. Here, we should only handle the first one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can change the behavior either without adding a legacy config and a migration guide.

Btw, if we unify the behaviors, then we don't need to keep the two implementations?
We might want to deprecate one of the two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.4.0 -> 3.0.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asc: Boolean -> asc: Column (see comments on the top of functions.scala)

Copy link
Contributor Author

@Gschiavon Gschiavon Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So asc should be a Column?

Are you referring to this comment?

* This function APIs usually have methods with Column signature only because it can support not * only Column but also other types such as a native string. The other variants currently exist * for historical reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed it

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin yeah I agree, the bigger question is, why do we have both array_sort and sort_array?

@kiszk looks like you added array_sort to match Presto in https://issues.apache.org/jira/browse/SPARK-23921 . Presto supports a comparator function rather than an ascending flag.

If sort_array does what we want already, and is the older method here, why not make array_sort an alias for it? We could deprecate one, but which one? sort_array has been around longer, but array_sort is at least the name in one other system.

Overall I'd just like to avoid duplicating the implementation, at least, even if both are supported as the function name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these would need to be added to Pyspark as well as R?

@Gschiavon
Copy link
Contributor Author

@ueshin yeah I agree, the bigger question is, why do we have both array_sort and sort_array?

@kiszk looks like you added array_sort to match Presto in https://issues.apache.org/jira/browse/SPARK-23921 . Presto supports a comparator function rather than an ascending flag.

If sort_array does what we want already, and is the older method here, why not make array_sort an alias for it? We could deprecate one, but which one? sort_array has been around longer, but array_sort is at least the name in one other system.

Overall I'd just like to avoid duplicating the implementation, at least, even if both are supported as the function name.

@srowen we couldn't make an alias between sort_array and array_sort because they don't have the same null policy, it would change the behaviour of array_sort.

@Gschiavon Gschiavon closed this Sep 10, 2019
@Gschiavon Gschiavon reopened this Sep 10, 2019
@srowen
Copy link
Member

srowen commented Sep 10, 2019

I see, is it that the ordering of nulls is different?
I think we'd at least want the doc of each to comment on the difference with the other very similar method, to make this clear. I suppose one day we may add control over null ordering and deprecate one in favor of the other.

@Gschiavon
Copy link
Contributor Author

I see, is it that the ordering of nulls is different?
I think we'd at least want the doc of each to comment on the difference with the other very similar method, to make this clear. I suppose one day we may add control over null ordering and deprecate one in favor of the other.

Okay @srowen , I'll explain the ordering of nulls for all as they are now:
sort_array in ascending order nulls are at the beginning
sort_array in descending order nulls are at the end
array_sort in ascending order nulls are at the end

And after the PR:
sort_array will remain the same
array_sort in ascending order nulls are at the end
array_sort in descending order nulls are at the end

So this way the actual behaviour of array_sort remains the same.

@srowen
Copy link
Member

srowen commented Sep 10, 2019

Yeah, I guess it's unfortunate that the existing null ordering semantics aren't the same, or else these could be unified. Later, maybe we can expose control over that too and then deprecate one in favor of fully specifying the desired ordering in the other.

But for now I'd just add a little bit of documentation pointing out that the behavior is different between array_sort and sort_array.

@Gschiavon
Copy link
Contributor Author

Gschiavon commented Sep 10, 2019

I think it would be good to have array_sort to be able to order in both ways, all the arrays functions from 2.4.0 are "array_something" and it's kind of confusing to have sort_array.
I think it would be great to have that unified.

@ueshin
Copy link
Member

ueshin commented Sep 10, 2019

As the original motivation, we might want to support a comparator function rather than an ascending flag to follow Presto. I think the null ordering is also following Presto, btw.

How about adding the comparator function here instead? Then we can make sort_array as Unvaluable and replace it to array_sort with appropriate comparator in Analyzer to share the implementation if still needed in the following PRs. I'm not sure we should deprecate sort_array in that case, though.

WDYT?

@kiszk
Copy link
Member

kiszk commented Sep 10, 2019

I agree with @ueshin 's opinion about array_sort. It would be great to add the comparator function instead of a flag. The bigger picture of https://issues.apache.org/jira/browse/SPARK-23921 is to follow array_sort function as Presto implements.

@HyukjinKwon
Copy link
Member

+1 of @ueshin's

@Gschiavon
Copy link
Contributor Author

How does it translate for this PR @ueshin ?

@HyukjinKwon
Copy link
Member

Add a Lambda expression for comparator instead of another column asc in array_sort. See array_sort(array(T), function(T, T, int)) -> array(T) at https://prestodb.github.io/docs/current/functions/array.html

In order to expose it in functions.scala, you can refer #24232.

@Gschiavon
Copy link
Contributor Author

Ok I understand. Also I think that might be a different PR? and then have different signatures for array_sort. For example:

array_sort(column) -> asc by default
array_sort(column, order)
array_sort(column, comparatorFunction)

Or you just want to have array_sort(column, comparatorFunction)?

I think having array_sort(column, order) is easier to use, and most of the times you just want to order asc or desc. At the same time having the comparatorFunction can open more possibilities.

@kiszk
Copy link
Member

kiszk commented Sep 12, 2019

I think that we want to have two signatures:

array_sort(column) -> asc by default
array_sort(column, order)

This is for compatibility with Presto.

@ueshin
Copy link
Member

ueshin commented Sep 12, 2019

Actually what I had in mind was we should have the two:

array_sort(column) -> asc by default
array_sort(column, comparatorFunction)

@kiszk I think you mean this as per your previous comment, right?

@Gschiavon
Copy link
Contributor Author

I understood the same @ueshin since he said to match presto

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113680 has finished for PR 25728 at commit 53b563e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Nov 13, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113721 has finished for PR 25728 at commit 53b563e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I'd leave this to @kiszk or other reviewers since this includes my code, so I might miss something.

@Gschiavon Could you remove [WIP] from the PR title?

@Gschiavon Gschiavon changed the title [SPARK-29020][WIP][SQL] Improving array_sort behaviour [SPARK-29020][SQL] Improving array_sort behaviour Nov 14, 2019
@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113749 has finished for PR 25728 at commit 7c38b8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Gschiavon
Copy link
Contributor Author

ping @kiszk

representing two elements of the array.
It returns -1, 0, or 1 as the first element is less than, equal to, or greater
than the second element. If the comparator function returns other
values (including NULL), the query will fail and raise an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: query -> function?

@ExpressionDescription(
usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the
input array must be orderable. Null elements will be placed at the end of the returned
array. Since 3.0.0 also sorts and returns the array based on the given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement Since 3.0.0 ... looks weird. For example, Since 3.0.0, this fuction sorts and returns ... or others.

*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = """_FUNC_(expr, func) - Sorts the input array in ascending order. The elements of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this phrase in ascending order always true?

@kiszk
Copy link
Member

kiszk commented Nov 17, 2019

@Gschiavon Sorry for being late since I made a business trip. I left a few comments.

@Gschiavon
Copy link
Contributor Author

No problem @kiszk ! I just made some changes based on your comments!

@SparkQA
Copy link

SparkQA commented Nov 17, 2019

Test build #113964 has finished for PR 25728 at commit ef28d4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

dongjoon-hyun pushed a commit that referenced this pull request Jan 30, 2020
### What changes were proposed in this pull request?

This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated.
This PR updates the description to follow the latest feature.

### Why are the changes needed?

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests since this PR just updates description text.

Closes #27404 from kiszk/SPARK-29020-followup.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
HyukjinKwon pushed a commit that referenced this pull request Aug 21, 2022
…aFrame operations

### What changes were proposed in this pull request?
Adding a new `array_sort` overload to `org.apache.spark.sql.functions` that matches the new overload defined in [SPARK-29020](https://issues.apache.org/jira/browse/SPARK-29020) and added via #25728.

### Why are the changes needed?
Adds access to the new overload for users of the DataFrame API so that they don't need to use the `expr` escape hatch.

### Does this PR introduce _any_ user-facing change?
Yes, now allows users to optionally provide a comparator function to the `array_sort`, which opens up the ability to sort descending as well as sort items that aren't naturally orderable.

#### Example:
Old:
```
df.selectExpr("array_sort(a, (x, y) -> cardinality(x) - cardinality(y))");
```

Added:
```
df.select(array_sort(col("a"), (x, y) => size(x) - size(y)));
```

### How was this patch tested?
Unit tests updated to validate that the overload matches the expression's behavior.

Closes #37361 from brandondahler/features/ArraySortOverload.

Authored-by: Brandon Dahler <bnd@amazon.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ulysses-you added a commit that referenced this pull request Aug 2, 2024
### What changes were proposed in this pull request?

This pr cleanup the legacy code of `SortArray` to remove `ArraySortLike` and inline `nullOrder`. The `ArraySort` has been rewritten since #25728, so the `SortArray` is the only implementation of `ArraySortLike`.

### Why are the changes needed?

cleanup the code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Pass CI

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #47547 from ulysses-you/cleanup.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: youxiduo <youxiduo@corp.netease.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants