[SPARK-24042][SQL] Collection function: zip_with_index #21121

mn-mikke · 2018-04-21T12:32:20Z

What changes were proposed in this pull request?

Adding function zip_with_index(array[, indexFirst, startFromZero]) that transforms the input array by encapsulating elements into pairs with indexes indicating the order.

zip_with_index(array("d", "a", null, "b")) => [("d",1),("a",2),(null,3),("b",4)]
zip_with_index(array("d", "a", null, "b"), true, false) => [(1,"d"),(2,"a"),(3,null),(4,"b")]
zip_with_index(array("d", "a", null, "b"), true, true) => [(0,"d"),(1,"a"),(2,null),(3,"b")]

How was this patch tested?

New tests added into:

CollectionExpressionSuite
DataFrameFunctionsSuite

Codegen examples

Primitive type

val df = Seq(
  Seq(1, 3, 4, 2),
  null
).toDF("i")
df.filter($"i".isNotNull || $"i".isNull).select(zip_with_index($"i")).debugCodegen

Result:

/* 033 */         boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
/* 034 */         ArrayData inputadapter_value_0 = inputadapter_isNull_0 ?
/* 035 */         null : (inputadapter_row_0.getArray(0));
/* 036 */
/* 037 */         boolean filter_value_0 = true;
/* 038 */
/* 039 */         if (!(!inputadapter_isNull_0)) {
/* 040 */           filter_value_0 = inputadapter_isNull_0;
/* 041 */         }
/* 042 */         if (!filter_value_0) continue;
/* 043 */
/* 044 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 045 */
/* 046 */         boolean project_isNull_0 = inputadapter_isNull_0;
/* 047 */         ArrayData project_value_0 = null;
/* 048 */
/* 049 */         if (!inputadapter_isNull_0) {
/* 050 */           final int project_numElements_0 = inputadapter_value_0.numElements();
/* 051 */
/* 052 */           final int project_structSize_0 = 24;
/* 053 */           final long project_byteArraySize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numElements_0, 8 + project_structSize_0);
/* 054 */           final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 * 8;
/* 055 */           if (project_byteArraySize_0 > 2147483632) {
/* 056 */             final Object[] project_internalRowArray_0 = new Object[project_numElements_0];
/* 057 */             for (int z = 0; z < project_numElements_0; z++) {
/* 058 */               project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{inputadapter_value_0.getInt(z), z + 1});
/* 059 */             }
/* 060 */             project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0);
/* 061 */
/* 062 */           } else {
/* 063 */             final byte[] project_byteArray_0 = new byte[(int)project_byteArraySize_0];
/* 064 */             UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData();
/* 065 */             Platform.putLong(project_byteArray_0, 16, project_numElements_0);
/* 066 */             project_unsafeArrayData_0.pointTo(project_byteArray_0, 16, (int)project_byteArraySize_0);
/* 067 */             UnsafeRow project_unsafeRow_0 = new UnsafeRow(2);
/* 068 */             for (int z = 0; z < project_numElements_0; z++) {
/* 069 */               long offset = project_structsOffset_0 + z * project_structSize_0;
/* 070 */               project_unsafeArrayData_0.setLong(z, (offset << 32) + project_structSize_0);
/* 071 */               project_unsafeRow_0.pointTo(project_byteArray_0, 16 + offset, project_structSize_0);
/* 072 */               if (false && inputadapter_value_0.isNullAt(z)) {
/* 073 */                 project_unsafeRow_0.setNullAt(0);
/* 074 */               } else {
/* 075 */                 project_unsafeRow_0.setInt(
/* 076 */                   0,
/* 077 */                   inputadapter_value_0.getInt(z)
/* 078 */                 );
/* 079 */               }
/* 080 */               project_unsafeRow_0.setInt(1, z + 1);
/* 081 */             }
/* 082 */             project_value_0 = project_unsafeArrayData_0;
/* 083 */           }
/* 084 */
/* 085 */         }

Non-primitive type

val df = Seq(
  Seq("d", "a", "f", "g"),
  null
).toDF("s")
df.filter($"s".isNotNull || $"s".isNull).select(zip_with_index($"s")).debugCodegen

Result:

/* 033 */         boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
/* 034 */         ArrayData inputadapter_value_0 = inputadapter_isNull_0 ?
/* 035 */         null : (inputadapter_row_0.getArray(0));
/* 036 */
/* 037 */         boolean filter_value_0 = true;
/* 038 */
/* 039 */         if (!(!inputadapter_isNull_0)) {
/* 040 */           filter_value_0 = inputadapter_isNull_0;
/* 041 */         }
/* 042 */         if (!filter_value_0) continue;
/* 043 */
/* 044 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 045 */
/* 046 */         boolean project_isNull_0 = inputadapter_isNull_0;
/* 047 */         ArrayData project_value_0 = null;
/* 048 */
/* 049 */         if (!inputadapter_isNull_0) {
/* 050 */           final int project_numElements_0 = inputadapter_value_0.numElements();
/* 051 */
/* 052 */           final Object[] project_internalRowArray_0 = new Object[project_numElements_0];
/* 053 */           for (int z = 0; z < project_numElements_0; z++) {
/* 054 */             project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{inputadapter_value_0.getUTF8String(z), z + 1});
/* 055 */           }
/* 056 */           project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0);
/* 057 */
/* 058 */         }

mn-mikke · 2018-04-21T12:41:22Z

cc @gatorsmile @ueshin @kiszk

HyukjinKwon · 2018-04-21T15:53:10Z

ok to test

HyukjinKwon · 2018-04-21T17:00:22Z

python/pyspark/sql/functions.py

nit: there's one more leading space here.

HyukjinKwon · 2018-04-21T17:01:25Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

nit: // scalastyle:on line.size.limit

HyukjinKwon · 2018-04-21T17:09:04Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Let's avoid using a default value in APIs. It doesn't work in Java.

gatorsmile · 2018-04-21T17:38:38Z

Which database has this function?

kiszk · 2018-04-21T17:41:20Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

nit: How about val (valuePosition, indexPosition) = if (indexFirstValue) ("1", "0") else ("0", "1")?

mn-mikke · 2018-04-21T18:39:41Z

@gatorsmile I'm not aware of any. From user experience, I strongly feel that such a function is missing. Escpecially, when transform function is introduced.

SparkQA · 2018-04-21T19:12:49Z

Test build #89679 has finished for PR 21121 at commit 551d04d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-21T22:19:39Z

Test build #89683 has finished for PR 21121 at commit a599544.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-22T11:44:56Z

Test build #89686 has finished for PR 21121 at commit 06348c3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-22T13:13:55Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Good spot. Thanks!

viirya · 2018-04-22T13:15:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Should the index be 0-based or 1-based? Other array functions seems to be 1-based.

That's really good question! The newly added functions element_at and array_position are 1-based. But on the other handed, the getItem from the Column class is 0-based. What about adding one extra parameter and let users decide whether the array will indexed from 0 or 1.

viirya · 2018-04-22T13:35:47Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Are we sure the input is always unsafe-backed array? If it is GenericArrayData?

Ah, I see. You just use unsafe-backed array as output.

viirya · 2018-04-22T13:45:06Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Btw, if we use GenericArrayData as output array, can't we avoid this limit?

I like your suggestion. So instead of throwing the exception, the function will execute a similar piece of code as in genCodeForNonPrimitiveElements...

Ah, we can alleviate this limitation ( up to MAX_ARRAY_LENGTHMAX_ARRAY_LENGTH elements) if we use GenericArrayData. BTW, we have to do the same check in genCodeForNonPrimitiveElements`, too.

nvm, this is zip that does not involve concat of multiple arrays.

SparkQA · 2018-04-24T21:10:00Z

Test build #89802 has finished for PR 21121 at commit fd4e473.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ZipWithIndex(child: Expression, indexFirst: Expression, startFromZero: Expression)

SparkQA · 2018-04-24T21:15:10Z

Test build #89803 has finished for PR 21121 at commit 67915f2.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T01:11:53Z

Test build #89804 has finished for PR 21121 at commit 4e9b140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-25T05:53:07Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Can we remove null check if containNulls is false even when elementType is not primitive type? For example, ArrayType(StringType, null)

Good spot! Thanks.

SparkQA · 2018-04-25T11:25:02Z

Test build #89838 has finished for PR 21121 at commit ffcebe3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-25T19:34:49Z

Test build #89857 has finished for PR 21121 at commit fd71544.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CachedRDDBuilder(
case class InMemoryRelation(

…back

… feedback from code review.

…contain nulls.

SparkQA · 2018-04-26T17:05:14Z

Test build #89887 has finished for PR 21121 at commit 51c8199.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrayJoin(
case class Flatten(child: Expression) extends UnaryExpression
case class MonthsBetween(
trait QueryPlanConstraints extends ConstraintHelper
trait ConstraintHelper
case class CachedRDDBuilder(
case class InMemoryRelation(
case class WriteToContinuousDataSource(
case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

SparkQA · 2018-04-26T17:54:18Z

Test build #89888 has finished for PR 21121 at commit bcd52bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrayJoin(
case class Flatten(child: Expression) extends UnaryExpression
case class MonthsBetween(
trait QueryPlanConstraints extends ConstraintHelper
trait ConstraintHelper
case class CachedRDDBuilder(
case class InMemoryRelation(
case class WriteToContinuousDataSource(
case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

ueshin · 2018-04-27T07:08:28Z

I'm still not sure we really need this function.
If the purpose is only for transform function you mentionsed at #21121 (comment), how about adding a second parameter to transform to pass the index? It would seem higher performance because we don't need to materialize the struct.
It's also up to @hvanhovell who I believe is working on transform, so I'd like to wait for his opinion as well.

lokm01 · 2018-04-27T12:00:58Z

@ueshin Currently we use our own implementation of zipWithIndex when we do explode and need to preserve the ordering of the array elements (especially if there is a shuffle involved in the subsequent transformation).

Sure, once transform becomes available, it will be much better and more performant to use that, but since we're dealing with production applications, we would like to start rewriting these jobs with those small "drop-in" replacements for functions such as zipWithIndex before going for a major rewrite with HOFs in spark SQL.

I've seen many threads in the community, which recommend the same approach when dealing with these difficult array cases - I'm pretty sure it will benefit other users.

mn-mikke · 2018-04-27T12:18:05Z

@ueshin What about combining zip_with_index with map_from_entries?

Tagar · 2018-05-01T05:52:08Z

Would this cover https://issues.apache.org/jira/browse/SPARK-23074 as well? Thanks.

rxin · 2018-05-01T18:25:18Z

@lokm01 wouldn't @ueshin's suggestion on adding a second parameter to transform work for you? You can just do something similar to transform(x, (entry, index) -> struct(entry, index)). Perhaps zip_with_index is just an alias for that.

mn-mikke · 2018-05-02T08:53:03Z

@rxin Oh, I see. In that case, I'm happy to close the PR. @hvanhovell Can you confirm that the transform function will pass the index into lambda functions?

AmplabJenkins · 2018-06-09T00:06:41Z

Can one of the admins verify this patch?

ueshin · 2018-08-09T08:39:42Z

@mn-mikke I think we can close this since we've added transform which can take the index argument as suggested.

mn-mikke · 2018-08-09T10:06:54Z

Sure, closing ...

HyukjinKwon reviewed Apr 21, 2018

View reviewed changes

kiszk reviewed Apr 21, 2018

View reviewed changes

viirya reviewed Apr 22, 2018

View reviewed changes

kiszk reviewed Apr 25, 2018

View reviewed changes

mn-mikke force-pushed the feature/array-api-zip_with_index-to-master branch from fd71544 to 51c8199 Compare April 26, 2018 13:49

mn-mikke added 9 commits April 26, 2018 15:56

[SPARK-24042][SQL] Collection function: zip_with_index

ffcf140

[SPARK-24042][SQL] Returning the python wrapper for reverse function …

4b4e02b

…back

[SPARK-24042][SQL] Small refactoring after review + fixing failing test

aa97cad

[SPARK-24042][SQL] Fixing scala style

1f11f73

[SPARK-24042][SQL] Fixing PySpark test + refactoring according to the…

9dac7a4

… feedback from code review.

[SPARK-24042][SQL] Fix of primitive-type codeGen.

8692c4d

[SPARK-24042][SQL] Fixing python code style.

17010b2

[SPARK-24042][SQL] Optimizing generated code for arrays that doesn't …

da270c7

…contain nulls.

[SPARK-24042][SQL] Merging current master to the feature branch.

bcd52bd

mn-mikke force-pushed the feature/array-api-zip_with_index-to-master branch from 51c8199 to bcd52bd Compare April 26, 2018 14:03

mn-mikke closed this Aug 9, 2018

ueshin mentioned this pull request Feb 5, 2020

[SPARK-28962][SQL][FOLLOW-UP] Add the parameter description for the Scala function API filter #27336

Closed

[SPARK-24042][SQL] Collection function: zip_with_index #21121

[SPARK-24042][SQL] Collection function: zip_with_index #21121

Uh oh!

Conversation

mn-mikke commented Apr 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Codegen examples

Primitive type

Non-primitive type

Uh oh!

mn-mikke commented Apr 21, 2018

Uh oh!

HyukjinKwon commented Apr 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mn-mikke commented Apr 21, 2018

Uh oh!

SparkQA commented Apr 21, 2018

Uh oh!

SparkQA commented Apr 21, 2018

Uh oh!

SparkQA commented Apr 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Apr 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Apr 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 24, 2018

Uh oh!

SparkQA commented Apr 24, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

kiszk Apr 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

SparkQA commented Apr 26, 2018

Uh oh!

mn-mikke commented Apr 21, 2018 •

edited

Loading

viirya Apr 22, 2018 •

edited

Loading

viirya Apr 22, 2018 •

edited

Loading

kiszk Apr 25, 2018 •

edited

Loading

ueshin commented Apr 27, 2018 •

edited

Loading