-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19887][SQL] dynamic partition keys can be null or empty string #17277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #74452 has finished for PR 17277 at commit
|
|
uh... Like Hive, should we treat |
| checkAnswer(spark.table("test"), | ||
| Row(1, "p", 1) :: Row(2, null, 2) :: Row(3, null, 3) :: Nil) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case only covers creating partitions. How about the partition-related DDL statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no they don't work. ALTER TABLE xxx ADD PARTITION(A=null), we will interpret null as a string instead of a null value. We should fix it in the follow-up.
|
Another interesting issue documented in a Hive JIRA (https://issues.apache.org/jira/browse/HIVE-1309):
|
|
@gatorsmile I think the behavior of spark SQL should be
This PR doesn't fix all of them but I think it's a good start to deal with special partition values. I'll create more JIRA tickets after this. |
|
Found a JIRA https://issues.apache.org/jira/browse/IMPALA-252 to explain how IMPALA handles it. Personally, I think what Impala proposed is reasonable. What do you think? @cloud-fan Static partition keys may not be NULL or the empty string |
|
Test build #74480 has finished for PR 17277 at commit
|
|
Test build #74482 has finished for PR 17277 at commit
|
| } | ||
| } | ||
|
|
||
| test(s"SPARK-19887 partition value is null - partition management $enabled") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is null -> is null or empty.
|
LGTM I am fine to merge it now. Definitely, we have multiple holes to fully support it. For example, we need to revisit all the usage of |
|
retest this please |
|
Test build #74553 has finished for PR 17277 at commit
|
|
thanks for the review, merging to master/2.1! |
When dynamic partition value is null or empty string, we should write the data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data back, we should respect this special directory name and treat it as null. This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252 new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17277 from cloud-fan/partition. (cherry picked from commit dacc382) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
| // make sure partition pruning also works. | ||
| checkAnswer(spark.table("test").filter($"b".isNotNull), Row(1, "p", 1)) | ||
|
|
||
| // empty string is an invalid partition value and we treat it as null when read back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird that you read back something different from what you wrote, "" and null are not the same strictly speaking. I would leave users to decide that "" is read back as null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember all the details as this PR is pretty old. This is probably the behavior of Hive so we just followed it.
Looking at it now, I agree it's not ideal to treat invalid partition values as null. We'd better fail earlier. Can we leave it as a known bug of v1 table and fix it in v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Hive cannot differentiate null and empty string in this case and we basically followed that for compatibility.
What changes were proposed in this pull request?
When dynamic partition value is null or empty string, we should write the data to a directory like
a=__HIVE_DEFAULT_PARTITION__, when we read the data back, we should respect this special directory name and treat it as null.This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252
How was this patch tested?
new regression test