-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26879][SQL] Standardize one-based column indexing for stack and json_tuple function #24051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@maropu Please review. |
|
ISTM the title and description mismatch yoru fix. |
|
ok to test |
|
Test build #103326 has finished for PR 24051 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What functions are affected by this change? Only Inline function?
|
Which one is more widely used by other DBs? 1-indexed or 0-indexed? |
|
Hi, @maropu ,Updated the title and description. This fix is basically for the struct function(as inline internally uses struct). But from the UT failure it seems that,hive uses one-based column indexing for the struct function. Is it ok to make this change? |
|
How about to standardize the column indexing to 1-indexed column? |
|
Can you check if we can use the 1-based index on all the places? |
33b973d to
23f2c13
Compare
|
Test build #103431 has finished for PR 24051 at commit
|
|
Test build #103432 has finished for PR 24051 at commit
|
23f2c13 to
bf26840
Compare
|
Test build #103434 has finished for PR 24051 at commit
|
…d json_tuple functions
bf26840 to
5ce5414
Compare
|
Test build #103499 has finished for PR 24051 at commit
|
|
@chakravarthiT are there more functions to fix? If this is the only last one, I guess it's fine. |
| assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2") | ||
| val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')") | ||
| assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think regardless of which way we standardize this is going to have to have a note in the release docs because it could cause run-time failure.
(Also while the JIRA has an affects version of 2.4.0 I think this should not be back ported only in 3).
| assert(dfstack.columns(0) == "col1" && dfstack.columns(1) == "col2") | ||
| val dfjson_tuple = sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')") | ||
| assert(dfjson_tuple.columns(0) == "col1" && dfjson_tuple.columns(1) == "col2") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is not bad to have, this way if we change the behaviour in the future unintentionally something will catch it.
@HyukjinKwon sorry for my late response.Yes this is the last one,rest all functions follows one based indexing. |
|
|
||
| override def elementSchema: StructType = StructType(fieldExpressions.zipWithIndex.map { | ||
| case (_, idx) => StructField(s"c$idx", StringType, nullable = true) | ||
| case (_, idx) => StructField(s"col${idx + 1}", StringType, nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last question, @chakravarthiT. What does Hive's json_tuple returns for the column names? I think we matched the column names with Hive when we added this long time ago.
|
@chakravarthiT can you update the migration guide, too? cc: @gatorsmile |
|
ping @chakravarthiT to update or close |
|
Can one of the admins verify this patch? |
|
I'll close this because of the author's inactivity. |
What changes were proposed in this pull request?
This PR is to standardise the column indexing of stack and json_tuple function,to follow one-based indexing.(as discussed on PR #23748 )
There is inconsistency in default column names for functions like stack and json_tuple. stack and json_tuple uses col0, col1 etc. (i.e. 0-indexed columns), while other functions like inline uses col1, col2, col2, etc. (i.e. 1-indexed columns).
How was this patch tested?
Added unit test.