Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Mar 13, 2017

What changes were proposed in this pull request?

When dynamic partition value is null or empty string, we should write the data to a directory like a=__HIVE_DEFAULT_PARTITION__, when we read the data back, we should respect this special directory name and treat it as null.

This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @eric @mallman @liancheng

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #74452 has finished for PR 17277 at commit 7b85c51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

uh... Like Hive, should we treat __HIVE_DEFAULT_PARTITION__ as a valid partition value? See the JIRA: https://issues.apache.org/jira/browse/HIVE-11208

checkAnswer(spark.table("test"),
Row(1, "p", 1) :: Row(2, null, 2) :: Row(3, null, 3) :: Nil)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case only covers creating partitions. How about the partition-related DDL statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no they don't work. ALTER TABLE xxx ADD PARTITION(A=null), we will interpret null as a string instead of a null value. We should fix it in the follow-up.

@gatorsmile
Copy link
Member

Another interesting issue documented in a Hive JIRA (https://issues.apache.org/jira/browse/HIVE-1309):

Currently if the dynamic partition column value is "bad" – null, empty string, etc., the row will be put into the HIVE_DEFAULT_PARTITION where the bad column value will be lost (replaced by HIVE_DEFAULT_PARTITION) if user select from that partition. It would be useful to put the bad record into an file specified by the user at DML/DDL time and the user can check the rows afterward.

@cloud-fan
Copy link
Contributor Author

@gatorsmile I think the behavior of spark SQL should be

  1. throw exception for invalid partition values(e.g. empty string), for both table write path(DataFrameWriter.saveAsTable) and data source write path DataFrameWriter.save
  2. null is a valid partition value.
  3. HIVE_DEFAULT_PARTITION is an invalid partition value

This PR doesn't fix all of them but I think it's a good start to deal with special partition values. I'll create more JIRA tickets after this.

@gatorsmile
Copy link
Member

gatorsmile commented Mar 14, 2017

Found a JIRA https://issues.apache.org/jira/browse/IMPALA-252 to explain how IMPALA handles it. Personally, I think what Impala proposed is reasonable. What do you think? @cloud-fan

Static partition keys may not be NULL or the empty string
So INSERT INTO TABLE tbl PARTITION(part="") SELECT ... will raise an error.
Dynamic partition keys may be empty or NULL
So INSERT INTO TABLE tbl PARTITION(part) SELECT ..., NULL will work.
Partitions with NULL or empty string keys are mapped to __HIVE_DEFAULT_PARTITION__
Whether the keys are NULL or "", both will be written to the same __HIVE_DEFAULT_PARTITION__ partition.
Values read from the partitioned column in partition HIVE_DEFAULT_PARTITION are mapped back to NULL
Here we deviate from Hive; Hive returns __HIVE_DEFAULT_PARTITION__ - even if the partition column is of integer type. This finally crosses the line of what we are willing to do to be compatible.
ALTER TABLE [ADD|DROP] will reject partitions with NULL or empty partition keys
You cannot create or delete default partitions manually.

@cloud-fan cloud-fan changed the title [SPARK-19887][SQL] null is a valid partition value [SPARK-19887][SQL] dynamic partition keys can be null or empty string Mar 14, 2017
@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74480 has finished for PR 17277 at commit a04e7e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74482 has finished for PR 17277 at commit 8896507.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test(s"SPARK-19887 partition value is null - partition management $enabled") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is null -> is null or empty.

@gatorsmile
Copy link
Member

LGTM I am fine to merge it now. Definitely, we have multiple holes to fully support it. For example, we need to revisit all the usage of ExternalCatalogUtils.escapePathName.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74553 has finished for PR 17277 at commit 8896507.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Mar 15, 2017

thanks for the review, merging to master/2.1!

@asfgit asfgit closed this in dacc382 Mar 15, 2017
asfgit pushed a commit that referenced this pull request Mar 15, 2017
When dynamic partition value is null or empty string, we should write the data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data back, we should respect this special directory name and treat it as null.

This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #17277 from cloud-fan/partition.

(cherry picked from commit dacc382)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
// make sure partition pruning also works.
checkAnswer(spark.table("test").filter($"b".isNotNull), Row(1, "p", 1))

// empty string is an invalid partition value and we treat it as null when read back.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird that you read back something different from what you wrote, "" and null are not the same strictly speaking. I would leave users to decide that "" is read back as null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember all the details as this PR is pretty old. This is probably the behavior of Hive so we just followed it.

Looking at it now, I agree it's not ideal to treat invalid partition values as null. We'd better fail earlier. Can we leave it as a known bug of v1 table and fix it in v2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Hive cannot differentiate null and empty string in this case and we basically followed that for compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants