-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23932][SQL] Higher order function zip_with #22031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| usage = "_FUNC_(expr, func) - Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.", | ||
| examples = """ | ||
| Examples: | ||
| > SELECT _FUNC_(array(1, 2, 3), x -> x + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples are not accurate.
You could something like:
> SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));
array(('a', 1), ('b', 3), ('c', 5))
> SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y));
array(4, 6)
> SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));
array('ad', 'be', 'cf')
| override def dataType: ArrayType = ArrayType(function.dataType, function.nullable) | ||
|
|
||
| override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArraysZipWith = { | ||
| val (leftElementType, leftContainsNull) = left.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can utilize HigherOrderFunction.arrayArgumentType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not valid anymore. The method has been removed by #22075.
| val leftArr = left.eval(input).asInstanceOf[ArrayData] | ||
| val rightArr = right.eval(input).asInstanceOf[ArrayData] | ||
|
|
||
| if (leftArr == null || rightArr == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If leftArr is null, right doesn't have to be evaluated.
| (elementType, containsNull) | ||
| } | ||
| copy(function = f(function, | ||
| (leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to support different size of input arrays (The jira ticket says: "Both arrays must be the same length."), what about the scenario when one array is empty and the second has elements? Shouldn't we use true instead of leftContainsNull and rightContainsNull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.
|
Test build #94389 has finished for PR 22031 at commit
|
|
Test build #94391 has finished for PR 22031 at commit
|
|
Test build #94392 has finished for PR 22031 at commit
|
| testArrayOfPrimitiveTypeContainsNull() | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for invalid cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you add tests to HigherOrderFunctionsSuite to check more explicit patterns?
|
|
||
| override def functions: Seq[Expression] = List(function) | ||
|
|
||
| def expectingFunctionType: AbstractDataType = AnyDataType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to define this?
| val LambdaFunction(_, | ||
| (arr1Var: NamedLambdaVariable):: (arr2Var: NamedLambdaVariable) :: Nil, _) = function | ||
| (arr1Var, arr2Var) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the following should work:
@transient lazy val LambdaFunction(_,
Seq(leftElemVar: NamedLambdaVariable, rightElemVar: NamedLambdaVariable), _) = function|
Hi @techaddict, |
|
Hi @ueshin I will update the PR tommorow |
|
@techaddict Thanks! I look forward to the update. |
|
Test build #94829 has finished for PR 22031 at commit
|
|
Test build #94830 has finished for PR 22031 at commit
|
| case _ => | ||
| val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType | ||
| (elementType, containsNull) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can do:
val ArrayType(leftElementType, leftContainsNull) = left.dataType
val ArrayType(rightElementType, rightContainsNull) = right.dataType| (elementType, containsNull) | ||
| } | ||
| copy(function = f(function, | ||
| (leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.
| right: Expression, | ||
| f: (Expression, Expression) => Expression): Expression = { | ||
| val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType] | ||
| val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need .asInstanceOf[ArrayType]?
| } | ||
|
|
||
| val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) | ||
| val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between ai0 and ai1?
| ArrayType(ArrayType(IntegerType, containsNull = false), containsNull = true)) | ||
| checkEvaluation( | ||
| zip_with(aai1, aai2, (a1, a2) => | ||
| Cast(zip_with(transform(a1, plusOne), transform(a2, plusOne), add), StringType)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
| checkAnswer(df1.selectExpr("zip_with(val1, val2, (x, y) -> x + y)"), expectedValue1) | ||
|
|
||
| val expectedValue2 = Seq( | ||
| Row(Seq(Row(1.0, 1), Row(2.0, null), Row(null, 3))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1.0 or 2.0 instead of 1L or 2L?
| select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)) as v; | ||
|
|
||
| -- Zip with array coalesce | ||
| select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, y)) as v; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a line break at the end of the file?
|
Test build #94833 has finished for PR 22031 at commit
|
|
Test build #94839 has finished for PR 22031 at commit
|
|
Jenkins, retest this please. |
|
@techaddict Could you fix the conflicts please? Thanks! |
|
LGTM pending Jenkins. |
|
Test build #94845 has finished for PR 22031 at commit
|
|
Test build #94846 has finished for PR 22031 at commit
|
|
Thanks! merging to master. |
## What changes were proposed in this pull request? This is a follow-up pr of apache#22031 which added `zip_with` function to fix an example. ## How was this patch tested? Existing tests. Closes apache#22194 from ueshin/issues/SPARK-23932/fix_examples. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:
How was this patch tested?
Added tests