Skip to content

Conversation

@chakravarthiT
Copy link
Contributor

@chakravarthiT chakravarthiT commented Mar 11, 2019

What changes were proposed in this pull request?

This PR is to standardise the column indexing of stack and json_tuple function,to follow one-based indexing.(as discussed on PR #23748 )

There is inconsistency in default column names for functions like stack and json_tuple. stack and json_tuple uses col0, col1 etc. (i.e. 0-indexed columns), while other functions like inline uses col1, col2, col2, etc. (i.e. 1-indexed columns).

How was this patch tested?

Added unit test.

@chakravarthiT
Copy link
Contributor Author

@maropu Please review.

@maropu
Copy link
Member

maropu commented Mar 11, 2019

ISTM the title and description mismatch yoru fix. CreateStruct is used in many places, but this fix only affects inline and stack?

@maropu
Copy link
Member

maropu commented Mar 11, 2019

ok to test

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103326 has finished for PR 24051 at commit 1958726.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chakravarthiT chakravarthiT changed the title [SPARK-26879][SQL] Zero based column indexing for inline functions [SPARK-26879][SQL] Zero based column indexing for struct function Mar 12, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What functions are affected by this change? Only Inline function?

@viirya
Copy link
Member

viirya commented Mar 12, 2019

Which one is more widely used by other DBs? 1-indexed or 0-indexed?

@chakravarthiT
Copy link
Contributor Author

Hi, @maropu ,Updated the title and description. This fix is basically for the struct function(as inline internally uses struct). But from the UT failure it seems that,hive uses one-based column indexing for the struct function. Is it ok to make this change?

@viirya
Copy link
Member

viirya commented Mar 12, 2019

How about to standardize the column indexing to 1-indexed column?

@maropu
Copy link
Member

maropu commented Mar 12, 2019

Can you check if we can use the 1-based index on all the places?

@chakravarthiT chakravarthiT force-pushed the ColumnDef branch 2 times, most recently from 33b973d to 23f2c13 Compare March 13, 2019 09:55
@chakravarthiT
Copy link
Contributor Author

@maropu @viirya Thanks,that is good idea. Only stack and json_tuple uses zero based column indexing.So,I have changed both functions to follow one-based indexing.

@chakravarthiT chakravarthiT changed the title [SPARK-26879][SQL] Zero based column indexing for struct function [SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function Mar 13, 2019
@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103431 has finished for PR 24051 at commit 33b973d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103432 has finished for PR 24051 at commit 23f2c13.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103434 has finished for PR 24051 at commit bf26840.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2019

Test build #103499 has finished for PR 24051 at commit 5ce5414.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chakravarthiT
Copy link
Contributor Author

@maropu @viirya please review

@HyukjinKwon
Copy link
Member

@chakravarthiT are there more functions to fix? If this is the only last one, I guess it's fine.

assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2")
val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')")
assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think regardless of which way we standardize this is going to have to have a note in the release docs because it could cause run-time failure.
(Also while the JIRA has an affects version of 2.4.0 I think this should not be back ported only in 3).

assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2")
val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')")
assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.

@chakravarthiT
Copy link
Contributor Author

@chakravarthiT are there more functions to fix? If this is the only last one, I guess it's fine.

@HyukjinKwon sorry for my late response.Yes this is the last one,rest all functions follows one based indexing.


override def elementSchema: StructType = StructType(fieldExpressions.zipWithIndex.map {
case (_, idx) => StructField(s"c$idx", StringType, nullable = true)
case (_, idx) => StructField(s"col${idx + 1}", StringType, nullable = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last question, @chakravarthiT. What does Hive's json_tuple returns for the column names? I think we matched the column names with Hive when we added this long time ago.

@maropu
Copy link
Member

maropu commented Jul 1, 2019

@chakravarthiT can you update the migration guide, too? cc: @gatorsmile

@HyukjinKwon
Copy link
Member

ping @chakravarthiT to update or close

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu
Copy link
Member

maropu commented Nov 6, 2019

I'll close this because of the author's inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants