[SPARK-23932][SQL] Higher order function zip_with #22031

techaddict · 2018-08-07T18:39:59Z

What changes were proposed in this pull request?

Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:

    SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- [ROW('a', 1), ROW('b', 3), ROW('c', 5)]
    SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6]
    SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> concat(x, y)); -- ['ad', 'be', 'cf']
    SELECT zip_with(ARRAY['a'], ARRAY['d', null, 'f'], (x, y) -> coalesce(x, y)); -- ['a', null, 'f']

How was this patch tested?

Added tests

This reverts commit 6f91777.

This reverts commit 03d19ce.

crafty-coder · 2018-08-07T21:12:34Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  usage = "_FUNC_(expr, func) - Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.",
+  examples = """
+    Examples:
+      > SELECT _FUNC_(array(1, 2, 3), x -> x + 1);


The examples are not accurate.

You could something like:

> SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x)); array(('a', 1), ('b', 3), ('c', 5)) > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y)); array(4, 6) > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)); array('ad', 'be', 'cf')

mn-mikke · 2018-08-07T21:55:43Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+  override def dataType: ArrayType = ArrayType(function.dataType, function.nullable)
+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArraysZipWith = {
+    val (leftElementType, leftContainsNull) = left.dataType match {


You can utilize HigherOrderFunction.arrayArgumentType.

This comment is not valid anymore. The method has been removed by #22075.

mn-mikke · 2018-08-07T22:01:47Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    val leftArr = left.eval(input).asInstanceOf[ArrayData]
+    val rightArr = right.eval(input).asInstanceOf[ArrayData]
+
+    if (leftArr == null || rightArr == null) {


If leftArr is null, right doesn't have to be evaluated.

mn-mikke · 2018-08-07T22:18:23Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+        (elementType, containsNull)
+    }
+    copy(function = f(function,
+      (leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil))


If you want to support different size of input arrays (The jira ticket says: "Both arrays must be the same length."), what about the scenario when one array is empty and the second has elements? Shouldn't we use true instead of leftContainsNull and rightContainsNull?

@mn-mikke @ueshin "both arrays must be the same length" was how zip_with in presto used to work, they've moved to appending nulls and process regardless.

If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.

SparkQA · 2018-08-07T23:28:27Z

Test build #94389 has finished for PR 22031 at commit 6f91777.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-08-08T00:36:54Z

Test build #94391 has finished for PR 22031 at commit 14ef371.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysZipWith(

SparkQA · 2018-08-08T01:33:46Z

Test build #94392 has finished for PR 22031 at commit c7e2dee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-08-08T03:12:00Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+    testArrayOfPrimitiveTypeContainsNull()
+  }
+
+


Can you add a test for invalid cases?

Also can you add tests to HigherOrderFunctionsSuite to check more explicit patterns?

ueshin · 2018-08-08T03:13:19Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+
+  override def functions: Seq[Expression] = List(function)
+
+  def expectingFunctionType: AbstractDataType = AnyDataType


We don't need to define this?

ueshin · 2018-08-08T03:15:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    val LambdaFunction(_,
+      (arr1Var: NamedLambdaVariable):: (arr2Var: NamedLambdaVariable) :: Nil, _) = function
+    (arr1Var, arr2Var)
+  }


nit: the following should work:

@transient lazy val LambdaFunction(_, Seq(leftElemVar: NamedLambdaVariable, rightElemVar: NamedLambdaVariable), _) = function

ueshin · 2018-08-15T04:25:30Z

Hi @techaddict,
Do you have time to continue working on this?
If you don't have enough time, I can take this over, so please let me know.
Thanks!

techaddict · 2018-08-15T04:28:11Z

Hi @ueshin I will update the PR tommorow

ueshin · 2018-08-15T04:31:12Z

@techaddict Thanks! I look forward to the update.

SparkQA · 2018-08-16T00:38:22Z

Test build #94829 has finished for PR 22031 at commit d6c44a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysZipWith(left: Expression, right: Expression, function: Expression)

SparkQA · 2018-08-16T02:55:40Z

Test build #94830 has finished for PR 22031 at commit 92cb34a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ZipWith(left: Expression, right: Expression, function: Expression)

ueshin · 2018-08-16T03:13:01Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      case _ =>
+        val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType
+        (elementType, containsNull)
+    }


Now we can do:

val ArrayType(leftElementType, leftContainsNull) = left.dataType val ArrayType(rightElementType, rightContainsNull) = right.dataType

ueshin · 2018-08-16T03:17:51Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+        (elementType, containsNull)
+    }
+    copy(function = f(function,
+      (leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil))


If we append nulls to the shorter array, both of the arguments might be null, so we should use true for nullabilities of the arguments as @mn-mikke suggested.

ueshin · 2018-08-16T03:19:45Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala

+        right: Expression,
+        f: (Expression, Expression) => Expression): Expression = {
+      val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType]
+      val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType]


nit: we don't need .asInstanceOf[ArrayType]?

ueshin · 2018-08-16T03:21:33Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala

+    }
+
+    val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false))
+    val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false))


What's the difference between ai0 and ai1?

ueshin · 2018-08-16T03:24:17Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala

+      ArrayType(ArrayType(IntegerType, containsNull = false), containsNull = true))
+    checkEvaluation(
+      zip_with(aai1, aai2, (a1, a2) =>
+          Cast(zip_with(transform(a1, plusOne), transform(a2, plusOne), add), StringType)),


nit: indent

ueshin · 2018-08-16T03:31:22Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+    checkAnswer(df1.selectExpr("zip_with(val1, val2, (x, y) -> x + y)"), expectedValue1)
+
+    val expectedValue2 = Seq(
+      Row(Seq(Row(1.0, 1), Row(2.0, null), Row(null, 3))),


Why 1.0 or 2.0 instead of 1L or 2L?

ueshin · 2018-08-16T03:33:38Z

sql/core/src/test/resources/sql-tests/inputs/higher-order-functions.sql

+select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)) as v;
+
+-- Zip with array coalesce
+select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, y)) as v;


Can you add a line break at the end of the file?

SparkQA · 2018-08-16T04:51:58Z

Test build #94833 has finished for PR 22031 at commit 0342ed9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-16T07:05:02Z

Test build #94839 has finished for PR 22031 at commit 16516ec.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-08-16T07:55:18Z

Jenkins, retest this please.

ueshin · 2018-08-16T08:11:39Z

@techaddict Could you fix the conflicts please? Thanks!

ueshin · 2018-08-16T09:15:42Z

LGTM pending Jenkins.

SparkQA · 2018-08-16T11:42:02Z

Test build #94845 has finished for PR 22031 at commit 16516ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-16T12:54:32Z

Test build #94846 has finished for PR 22031 at commit 2388130.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-08-16T14:01:34Z

Thanks! merging to master.

## What changes were proposed in this pull request? This is a follow-up pr of apache#22031 which added `zip_with` function to fix an example. ## How was this patch tested? Existing tests. Closes apache#22194 from ueshin/issues/SPARK-23932/fix_examples. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

techaddict added 6 commits August 2, 2018 21:15

.

03d19ce

fix description

6f91777

Revert "fix description"

cc0752a

This reverts commit 6f91777.

Revert "."

f20d646

This reverts commit 03d19ce.

Merge branch 'master' into SPARK-23932

f8c0320

merge master

14ef371

crafty-coder reviewed Aug 7, 2018

View reviewed changes

address PR comments

c7e2dee

mn-mikke reviewed Aug 7, 2018

View reviewed changes

ueshin reviewed Aug 8, 2018

View reviewed changes

techaddict added 3 commits August 15, 2018 14:06

Merge remote-tracking branch 'upstream/master' into SPARK-23932

35d2cbc

Add tests

d6c44a6

test in HigherOrderFunctionsSuite

92cb34a

add more tests

0342ed9

techaddict changed the title ~~[TODO][SPARK-23932][SQL] Higher order function zip_with~~ [SPARK-23932][SQL] Higher order function zip_with Aug 16, 2018

ueshin reviewed Aug 16, 2018

View reviewed changes

address all comments

16516ec

Merge branch 'master' into SPARK-23932

248bccf

rebase on master

2388130

asfgit closed this in ea63a7a Aug 16, 2018

ueshin mentioned this pull request Aug 23, 2018

[SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with function. #22194

Closed

gatorsmile mentioned this pull request Oct 25, 2018

[SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions #22827

Closed


		override def functions: Seq[Expression] = List(function)

		def expectingFunctionType: AbstractDataType = AnyDataType

[SPARK-23932][SQL] Higher order function zip_with #22031

[SPARK-23932][SQL] Higher order function zip_with #22031

Uh oh!

Conversation

techaddict commented Aug 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 7, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented Aug 15, 2018

Uh oh!

techaddict commented Aug 15, 2018

Uh oh!

ueshin commented Aug 15, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

ueshin commented Aug 16, 2018

Uh oh!

ueshin commented Aug 16, 2018

Uh oh!

ueshin commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

ueshin commented Aug 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects