Skip to content

Conversation

@techaddict
Copy link
Contributor

What changes were proposed in this pull request?

Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:

    SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- [ROW('a', 1), ROW('b', 3), ROW('c', 5)]
    SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6]
    SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> concat(x, y)); -- ['ad', 'be', 'cf']
    SELECT zip_with(ARRAY['a'], ARRAY['d', null, 'f'], (x, y) -> coalesce(x, y)); -- ['a', null, 'f']

How was this patch tested?

Added tests

usage = "_FUNC_(expr, func) - Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.",
examples = """
Examples:
> SELECT _FUNC_(array(1, 2, 3), x -> x + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples are not accurate.

You could something like:

 > SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x));                           
  array(('a', 1), ('b', 3), ('c', 5))                                                               
 > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y));                                       
  array(4, 6)                                                                                       
 > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y));               
  array('ad', 'be', 'cf')                                                                           

override def dataType: ArrayType = ArrayType(function.dataType, function.nullable)

override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArraysZipWith = {
val (leftElementType, leftContainsNull) = left.dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can utilize HigherOrderFunction.arrayArgumentType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not valid anymore. The method has been removed by #22075.

val leftArr = left.eval(input).asInstanceOf[ArrayData]
val rightArr = right.eval(input).asInstanceOf[ArrayData]

if (leftArr == null || rightArr == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If leftArr is null, right doesn't have to be evaluated.

(elementType, containsNull)
}
copy(function = f(function,
(leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to support different size of input arrays (The jira ticket says: "Both arrays must be the same length."), what about the scenario when one array is empty and the second has elements? Shouldn't we use true instead of leftContainsNull and rightContainsNull?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke @ueshin "both arrays must be the same length" was how zip_with in presto used to work, they've moved to appending nulls and process regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94389 has finished for PR 22031 at commit 6f91777.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94391 has finished for PR 22031 at commit 14ef371.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArraysZipWith(

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94392 has finished for PR 22031 at commit c7e2dee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

testArrayOfPrimitiveTypeContainsNull()
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for invalid cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you add tests to HigherOrderFunctionsSuite to check more explicit patterns?


override def functions: Seq[Expression] = List(function)

def expectingFunctionType: AbstractDataType = AnyDataType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to define this?

val LambdaFunction(_,
(arr1Var: NamedLambdaVariable):: (arr2Var: NamedLambdaVariable) :: Nil, _) = function
(arr1Var, arr2Var)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the following should work:

@transient lazy val LambdaFunction(_,
  Seq(leftElemVar: NamedLambdaVariable, rightElemVar: NamedLambdaVariable), _) = function

@ueshin
Copy link
Member

ueshin commented Aug 15, 2018

Hi @techaddict,
Do you have time to continue working on this?
If you don't have enough time, I can take this over, so please let me know.
Thanks!

@techaddict
Copy link
Contributor Author

Hi @ueshin I will update the PR tommorow

@ueshin
Copy link
Member

ueshin commented Aug 15, 2018

@techaddict Thanks! I look forward to the update.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94829 has finished for PR 22031 at commit d6c44a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArraysZipWith(left: Expression, right: Expression, function: Expression)

@techaddict techaddict changed the title [TODO][SPARK-23932][SQL] Higher order function zip_with [SPARK-23932][SQL] Higher order function zip_with Aug 16, 2018
@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94830 has finished for PR 22031 at commit 92cb34a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ZipWith(left: Expression, right: Expression, function: Expression)

case _ =>
val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType
(elementType, containsNull)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can do:

val ArrayType(leftElementType, leftContainsNull) = left.dataType
val ArrayType(rightElementType, rightContainsNull) = right.dataType

(elementType, containsNull)
}
copy(function = f(function,
(leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.

right: Expression,
f: (Expression, Expression) => Expression): Expression = {
val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType]
val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need .asInstanceOf[ArrayType]?

}

val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false))
val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between ai0 and ai1?

ArrayType(ArrayType(IntegerType, containsNull = false), containsNull = true))
checkEvaluation(
zip_with(aai1, aai2, (a1, a2) =>
Cast(zip_with(transform(a1, plusOne), transform(a2, plusOne), add), StringType)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

checkAnswer(df1.selectExpr("zip_with(val1, val2, (x, y) -> x + y)"), expectedValue1)

val expectedValue2 = Seq(
Row(Seq(Row(1.0, 1), Row(2.0, null), Row(null, 3))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1.0 or 2.0 instead of 1L or 2L?

select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)) as v;

-- Zip with array coalesce
select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, y)) as v;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a line break at the end of the file?

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94833 has finished for PR 22031 at commit 0342ed9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94839 has finished for PR 22031 at commit 16516ec.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

Jenkins, retest this please.

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

@techaddict Could you fix the conflicts please? Thanks!

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94845 has finished for PR 22031 at commit 16516ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94846 has finished for PR 22031 at commit 2388130.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

Thanks! merging to master.

@asfgit asfgit closed this in ea63a7a Aug 16, 2018
HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Aug 23, 2018
## What changes were proposed in this pull request?

This is a follow-up pr of apache#22031 which added `zip_with` function to fix an example.

## How was this patch tested?

Existing tests.

Closes apache#22194 from ueshin/issues/SPARK-23932/fix_examples.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants