-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15743][SQL] Prevent saving with all-column partitioning #13486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #59902 has finished for PR 13486 at commit
|
|
Test build #59977 has finished for PR 13486 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One little concern. If it is added here, should the method name be changed? After all it will do more than validating data types after the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for attention, @wangyang1992 . Good point!
Maybe, validatePartitionColumnDataTypes -> validatePartitionColumnDataTypesAndCount ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, let's change it. :)
Since PartitionUtils is private[sql], it's safe to be changed.
I'll update this PR. Thank you for your review and idea!
|
Test build #59986 has finished for PR 13486 at commit
|
|
Test build #59987 has finished for PR 13486 at commit
|
|
Test build #60024 has finished for PR 13486 at commit
|
|
Hi, @marmbrus . |
| Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) | ||
|
|
||
| def validatePartitionColumnDataTypes( | ||
| def validatePartitionColumnDataTypesAndCount( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about just validatePartitionColumns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @marmbrus .
That sounds better. I'll update that.
|
Test build #60193 has finished for PR 13486 at commit
|
|
Test build #60203 has finished for PR 13486 at commit
|
|
Hi, @marmbrus . |
|
Hi, @marmbrus . |
|
Merging to master and 2.0 |
## What changes were proposed in this pull request?
When saving datasets on storage, `partitionBy` provides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs.
- **ORC with all column partitioning**: `AnalysisException` on **future read** due to schema inference failure.
```scala
scala> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data")
scala> spark.read.format("orc").load("/tmp/data").collect()
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /tmp/data. It must be specified manually;
```
- **Parquet with all-column partitioning**: `InvalidSchemaException` on **write execution** due to Parquet limitation.
```scala
scala> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data")
[Stage 0:> (0 + 8) / 8]16/06/02 16:51:17
ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema
... (lots of error messages)
```
Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories.
This PR prevents saving with all-column partitioning by consistently raising `AnalysisException` before executing save operation.
## How was this patch tested?
Newly added `PartitioningUtilsSuite`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13486 from dongjoon-hyun/SPARK-15743.
(cherry picked from commit 2413fce)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
Thank you, @marmbrus ! |
|
@dongjoon-hyun There is a corner case that this check does not handle. When there are zero columns, the error is very non-intuitive. See https://issues.apache.org/jira/browse/SPARK-16006 Can you handle this case in another PR. Refer me in the PR if you make one. |
|
Oh, I see. I will fix tonight. |
| spark.range(10).write.format("parquet").mode("overwrite").partitionBy("id").save(path) | ||
| } | ||
| intercept[AnalysisException] { | ||
| spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is wrong, right? This exception message will be like The ORC data source must be used with Hive support enabled. To test this, we need to move this to another suite.
Will fix it in my ongoing PR. My suggestion is to always verify the error message, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. That will be fixed in #13730 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean there is already ongoing PR.
And, thank you for advice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine if you want to fix it. You can put it in OrcSuite. Then, running the following command to verify whether it passes or not:
build/sbt -Phive "hive/test-only org.apache.spark.sql.hive.orc.OrcSourceSuite"
test("prevent all column partitioning") {
withTempDir { dir =>
val path = dir.getCanonicalPath
val e = intercept[AnalysisException] {
spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save(path)
}.getMessage
assert(e.contains("Cannot use all columns for partition columns"))
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean. I'm going to exclude ORC cases in #13730 .
So, for the OrcSuite, you can do that. If then, I'll really appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to add more edge cases in the ongoing PR for improving the test coverage of DataFrameReader and DataFrameWriter. Will include this too. Thanks!
What changes were proposed in this pull request?
When saving datasets on storage,
partitionByprovides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs.ORC with all column partitioning:
AnalysisExceptionon future read due to schema inference failure.Parquet with all-column partitioning:
InvalidSchemaExceptionon write execution due to Parquet limitation.Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories.
This PR prevents saving with all-column partitioning by consistently raising
AnalysisExceptionbefore executing save operation.How was this patch tested?
Newly added
PartitioningUtilsSuite.