[SPARK-19887][SQL] dynamic partition keys can be null or empty string #17277

cloud-fan · 2017-03-13T16:15:40Z

What changes were proposed in this pull request?

When dynamic partition value is null or empty string, we should write the data to a directory like a=__HIVE_DEFAULT_PARTITION__, when we read the data back, we should respect this special directory name and treat it as null.

This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252

How was this patch tested?

new regression test

cloud-fan · 2017-03-13T16:16:15Z

cc @eric @mallman @liancheng

SparkQA · 2017-03-13T19:30:06Z

Test build #74452 has finished for PR 17277 at commit 7b85c51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-13T21:03:22Z

uh... Like Hive, should we treat __HIVE_DEFAULT_PARTITION__ as a valid partition value? See the JIRA: https://issues.apache.org/jira/browse/HIVE-11208

gatorsmile · 2017-03-13T21:05:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+        checkAnswer(spark.table("test"),
+          Row(1, "p", 1) :: Row(2, null, 2) :: Row(3, null, 3) :: Nil)
+      }
+    }


This test case only covers creating partitions. How about the partition-related DDL statements?

no they don't work. ALTER TABLE xxx ADD PARTITION(A=null), we will interpret null as a string instead of a null value. We should fix it in the follow-up.

gatorsmile · 2017-03-13T21:15:59Z

Another interesting issue documented in a Hive JIRA (https://issues.apache.org/jira/browse/HIVE-1309):

Currently if the dynamic partition column value is "bad" – null, empty string, etc., the row will be put into the HIVE_DEFAULT_PARTITION where the bad column value will be lost (replaced by HIVE_DEFAULT_PARTITION) if user select from that partition. It would be useful to put the bad record into an file specified by the user at DML/DDL time and the user can check the rows afterward.

cloud-fan · 2017-03-14T00:47:45Z

@gatorsmile I think the behavior of spark SQL should be

throw exception for invalid partition values(e.g. empty string), for both table write path(DataFrameWriter.saveAsTable) and data source write path DataFrameWriter.save
null is a valid partition value.
HIVE_DEFAULT_PARTITION is an invalid partition value

This PR doesn't fix all of them but I think it's a good start to deal with special partition values. I'll create more JIRA tickets after this.

gatorsmile · 2017-03-14T02:17:40Z

Found a JIRA https://issues.apache.org/jira/browse/IMPALA-252 to explain how IMPALA handles it. Personally, I think what Impala proposed is reasonable. What do you think? @cloud-fan

Static partition keys may not be NULL or the empty string
So INSERT INTO TABLE tbl PARTITION(part="") SELECT ... will raise an error.
Dynamic partition keys may be empty or NULL
So INSERT INTO TABLE tbl PARTITION(part) SELECT ..., NULL will work.
Partitions with NULL or empty string keys are mapped to __HIVE_DEFAULT_PARTITION__
Whether the keys are NULL or "", both will be written to the same __HIVE_DEFAULT_PARTITION__ partition.
Values read from the partitioned column in partition HIVE_DEFAULT_PARTITION are mapped back to NULL
Here we deviate from Hive; Hive returns __HIVE_DEFAULT_PARTITION__ - even if the partition column is of integer type. This finally crosses the line of what we are willing to do to be compatible.
ALTER TABLE [ADD|DROP] will reject partitions with NULL or empty partition keys
You cannot create or delete default partitions manually.

SparkQA · 2017-03-14T05:33:42Z

Test build #74480 has finished for PR 17277 at commit a04e7e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-14T05:40:02Z

Test build #74482 has finished for PR 17277 at commit 8896507.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-14T19:52:52Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

      }
    }
+
+    test(s"SPARK-19887 partition value is null - partition management $enabled") {


Nit: is null -> is null or empty.

gatorsmile · 2017-03-14T19:55:10Z

LGTM I am fine to merge it now. Definitely, we have multiple holes to fully support it. For example, we need to revisit all the usage of ExternalCatalogUtils.escapePathName.

gatorsmile · 2017-03-14T21:01:37Z

retest this please

SparkQA · 2017-03-14T23:18:39Z

Test build #74553 has finished for PR 17277 at commit 8896507.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-15T00:24:57Z

thanks for the review, merging to master/2.1!

When dynamic partition value is null or empty string, we should write the data to a directory like `a=__HIVE_DEFAULT_PARTITION__`, when we read the data back, we should respect this special directory name and treat it as null. This is the same behavior of impala, see https://issues.apache.org/jira/browse/IMPALA-252 new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #17277 from cloud-fan/partition. (cherry picked from commit dacc382) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk · 2020-11-29T19:03:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala

+        // make sure partition pruning also works.
+        checkAnswer(spark.table("test").filter($"b".isNotNull), Row(1, "p", 1))
+
+        // empty string is an invalid partition value and we treat it as null when read back.


This looks weird that you read back something different from what you wrote, "" and null are not the same strictly speaking. I would leave users to decide that "" is read back as null.

I can't remember all the details as this PR is pretty old. This is probably the behavior of Hive so we just followed it.

Looking at it now, I agree it's not ideal to treat invalid partition values as null. We'd better fail earlier. Can we leave it as a known bug of v1 table and fix it in v2?

Yeah, Hive cannot differentiate null and empty string in this case and we basically followed that for compatibility.

null is a valid partition value

7b85c51

gatorsmile reviewed Mar 13, 2017

View reviewed changes

cloud-fan changed the title ~~[SPARK-19887][SQL] null is a valid partition value~~ [SPARK-19887][SQL] dynamic partition keys can be null or empty string Mar 14, 2017

handle empty string

8896507

cloud-fan force-pushed the partition branch from a04e7e5 to 8896507 Compare March 14, 2017 03:24

gatorsmile reviewed Mar 14, 2017

View reviewed changes

asfgit closed this in dacc382 Mar 15, 2017

MaxGekk reviewed Nov 29, 2020

View reviewed changes

peasee mentioned this pull request Oct 11, 2025

fix: Ensure ListingTable partitions are pruned when filters are not used apache/datafusion#17958

Merged

alamb mentioned this pull request Oct 15, 2025

ListingTable handling of missing partition values apache/datafusion#18083

Open

[SPARK-19887][SQL] dynamic partition keys can be null or empty string #17277

[SPARK-19887][SQL] dynamic partition keys can be null or empty string #17277

Uh oh!

Conversation

cloud-fan commented Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

gatorsmile commented Mar 13, 2017

Uh oh!

gatorsmile Mar 13, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 14, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 13, 2017

Uh oh!

cloud-fan commented Mar 14, 2017

Uh oh!

gatorsmile commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

gatorsmile Mar 14, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 14, 2017

Uh oh!

gatorsmile commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

cloud-fan commented Mar 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk Nov 29, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 30, 2020

Choose a reason for hiding this comment

Uh oh!

liancheng Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Mar 13, 2017 •

edited

Loading

gatorsmile commented Mar 14, 2017 •

edited

Loading

cloud-fan commented Mar 15, 2017 •

edited

Loading