[SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function #24051

chakravarthiT · 2019-03-11T10:10:49Z

What changes were proposed in this pull request?

This PR is to standardise the column indexing of stack and json_tuple function,to follow one-based indexing.(as discussed on PR #23748 )

There is inconsistency in default column names for functions like stack and json_tuple. stack and json_tuple uses col0, col1 etc. (i.e. 0-indexed columns), while other functions like inline uses col1, col2, col2, etc. (i.e. 1-indexed columns).

How was this patch tested?

Added unit test.

chakravarthiT · 2019-03-11T10:13:07Z

@maropu Please review.

maropu · 2019-03-11T13:19:18Z

ISTM the title and description mismatch yoru fix. CreateStruct is used in many places, but this fix only affects inline and stack?

maropu · 2019-03-11T13:20:08Z

ok to test

SparkQA · 2019-03-11T15:21:57Z

Test build #103326 has finished for PR 24051 at commit 1958726.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-03-12T09:33:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

What functions are affected by this change? Only Inline function?

viirya · 2019-03-12T09:36:15Z

Which one is more widely used by other DBs? 1-indexed or 0-indexed?

chakravarthiT · 2019-03-12T09:38:56Z

Hi, @maropu ,Updated the title and description. This fix is basically for the struct function(as inline internally uses struct). But from the UT failure it seems that,hive uses one-based column indexing for the struct function. Is it ok to make this change?

viirya · 2019-03-12T09:49:38Z

How about to standardize the column indexing to 1-indexed column?

maropu · 2019-03-12T11:13:25Z

Can you check if we can use the 1-based index on all the places?

chakravarthiT · 2019-03-13T10:16:31Z

@maropu @viirya Thanks,that is good idea. Only stack and json_tuple uses zero based column indexing.So,I have changed both functions to follow one-based indexing.

SparkQA · 2019-03-13T11:52:20Z

Test build #103431 has finished for PR 24051 at commit 33b973d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T11:52:31Z

Test build #103432 has finished for PR 24051 at commit 23f2c13.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T16:18:57Z

Test build #103434 has finished for PR 24051 at commit bf26840.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…d json_tuple functions

SparkQA · 2019-03-14T19:01:29Z

Test build #103499 has finished for PR 24051 at commit 5ce5414.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

chakravarthiT · 2019-03-15T04:41:48Z

@maropu @viirya please review

HyukjinKwon · 2019-04-26T07:41:32Z

@chakravarthiT are there more functions to fix? If this is the only last one, I guess it's fine.

HyukjinKwon · 2019-04-26T07:42:37Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2")
+    val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')")
+    assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2")
+  }


I think we don't need this test.

I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.

holdenk

I think regardless of which way we standardize this is going to have to have a note in the release docs because it could cause run-time failure.
(Also while the JIRA has an affects version of 2.4.0 I think this should not be back ported only in 3).

holdenk · 2019-05-10T19:10:18Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2")
+    val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')")
+    assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2")
+  }


I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.

chakravarthiT · 2019-06-03T17:32:29Z

@chakravarthiT are there more functions to fix? If this is the only last one, I guess it's fine.

@HyukjinKwon sorry for my late response.Yes this is the last one,rest all functions follows one based indexing.

HyukjinKwon · 2019-06-13T10:39:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala


  override def elementSchema: StructType = StructType(fieldExpressions.zipWithIndex.map {
-    case (_, idx) => StructField(s"c$idx", StringType, nullable = true)
+    case (_, idx) => StructField(s"col${idx + 1}", StringType, nullable = true)


last question, @chakravarthiT. What does Hive's json_tuple returns for the column names? I think we matched the column names with Hive when we added this long time ago.

maropu · 2019-07-01T01:22:47Z

@chakravarthiT can you update the migration guide, too? cc: @gatorsmile

HyukjinKwon · 2019-09-17T00:09:35Z

ping @chakravarthiT to update or close

AmplabJenkins · 2019-11-06T20:34:57Z

Can one of the admins verify this patch?

maropu · 2019-11-06T22:45:18Z

I'll close this because of the author's inactivity.

chakravarthiT changed the title ~~[SPARK-26879][SQL] Zero based column indexing for inline functions~~ [SPARK-26879][SQL] Zero based column indexing for struct function Mar 12, 2019

viirya reviewed Mar 12, 2019

View reviewed changes

chakravarthiT force-pushed the ColumnDef branch 2 times, most recently from 33b973d to 23f2c13 Compare March 13, 2019 09:55

chakravarthiT changed the title ~~[SPARK-26879][SQL] Zero based column indexing for struct function~~ [SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function Mar 13, 2019

chakravarthiT force-pushed the ColumnDef branch from 23f2c13 to bf26840 Compare March 13, 2019 12:16

[SPARK-26879][SQL] standardize one-based column indexing for stack an…

5ce5414

…d json_tuple functions

chakravarthiT force-pushed the ColumnDef branch from bf26840 to 5ce5414 Compare March 14, 2019 12:33

HyukjinKwon reviewed Apr 26, 2019

View reviewed changes

holdenk reviewed May 10, 2019

View reviewed changes

HyukjinKwon reviewed Jun 13, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

maropu closed this Nov 6, 2019

maropu mentioned this pull request Jan 23, 2020

[SPARK-28962][SQL] Provide index argument to filter lambda functions #25666

Closed

[SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function #24051

[SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function #24051

Uh oh!

Conversation

chakravarthiT commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

chakravarthiT commented Mar 11, 2019

Uh oh!

maropu commented Mar 11, 2019

Uh oh!

maropu commented Mar 11, 2019

Uh oh!

SparkQA commented Mar 11, 2019

Uh oh!

viirya Mar 12, 2019

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 12, 2019

Uh oh!

chakravarthiT commented Mar 12, 2019

Uh oh!

viirya commented Mar 12, 2019

Uh oh!

maropu commented Mar 12, 2019

Uh oh!

chakravarthiT commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 13, 2019

Uh oh!

SparkQA commented Mar 14, 2019

Uh oh!

chakravarthiT commented Mar 15, 2019

Uh oh!

HyukjinKwon commented Apr 26, 2019

Uh oh!

HyukjinKwon Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

holdenk May 10, 2019

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk May 10, 2019

Choose a reason for hiding this comment

Uh oh!

chakravarthiT commented Jun 3, 2019

Uh oh!

HyukjinKwon Jun 13, 2019

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 1, 2019

Uh oh!

HyukjinKwon commented Sep 17, 2019

Uh oh!

AmplabJenkins commented Nov 6, 2019

Uh oh!

maropu commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

chakravarthiT commented Mar 11, 2019 •

edited

Loading