-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24042][SQL] Collection function: zip_with_index #21121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24042][SQL] Collection function: zip_with_index #21121
Conversation
|
ok to test |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's one more leading space here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // scalastyle:on line.size.limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using a default value in APIs. It doesn't work in Java.
|
Which database has this function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: How about val (valuePosition, indexPosition) = if (indexFirstValue) ("1", "0") else ("0", "1")?
|
@gatorsmile I'm not aware of any. From user experience, I strongly feel that such a function is missing. Escpecially, when transform function is introduced. |
|
Test build #89679 has finished for PR 21121 at commit
|
|
Test build #89683 has finished for PR 21121 at commit
|
|
Test build #89686 has finished for PR 21121 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the index be 0-based or 1-based? Other array functions seems to be 1-based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really good question! The newly added functions element_at and array_position are 1-based. But on the other handed, the getItem from the Column class is 0-based. What about adding one extra parameter and let users decide whether the array will indexed from 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the input is always unsafe-backed array? If it is GenericArrayData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. You just use unsafe-backed array as output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, if we use GenericArrayData as output array, can't we avoid this limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion. So instead of throwing the exception, the function will execute a similar piece of code as in genCodeForNonPrimitiveElements...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we can alleviate this limitation ( up to MAX_ARRAY_LENGTHMAX_ARRAY_LENGTH elements) if we use GenericArrayData. BTW, we have to do the same check in genCodeForNonPrimitiveElements`, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, this is zip that does not involve concat of multiple arrays.
|
Test build #89802 has finished for PR 21121 at commit
|
|
Test build #89803 has finished for PR 21121 at commit
|
|
Test build #89804 has finished for PR 21121 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove null check if containNulls is false even when elementType is not primitive type? For example, ArrayType(StringType, null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot! Thanks.
|
Test build #89838 has finished for PR 21121 at commit
|
|
Test build #89857 has finished for PR 21121 at commit
|
fd71544 to
51c8199
Compare
… feedback from code review.
51c8199 to
bcd52bd
Compare
|
Test build #89887 has finished for PR 21121 at commit
|
|
Test build #89888 has finished for PR 21121 at commit
|
|
I'm still not sure we really need this function. |
|
@ueshin Currently we use our own implementation of zipWithIndex when we do explode and need to preserve the ordering of the array elements (especially if there is a shuffle involved in the subsequent transformation). Sure, once transform becomes available, it will be much better and more performant to use that, but since we're dealing with production applications, we would like to start rewriting these jobs with those small "drop-in" replacements for functions such as zipWithIndex before going for a major rewrite with HOFs in spark SQL. I've seen many threads in the community, which recommend the same approach when dealing with these difficult array cases - I'm pretty sure it will benefit other users. |
|
@ueshin What about combining |
|
Would this cover https://issues.apache.org/jira/browse/SPARK-23074 as well? Thanks. |
|
@rxin Oh, I see. In that case, I'm happy to close the PR. @hvanhovell Can you confirm that the |
|
Can one of the admins verify this patch? |
|
@mn-mikke I think we can close this since we've added |
|
Sure, closing ... |
What changes were proposed in this pull request?
Adding function zip_with_index(array[, indexFirst, startFromZero]) that transforms the input array by encapsulating elements into pairs with indexes indicating the order.
How was this patch tested?
New tests added into:
Codegen examples
Primitive type
Result:
Non-primitive type
Result: