-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19016][SQL][DOC] Document scalable partition handling #16424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19016][SQL][DOC] Document scalable partition handling #16424
Conversation
|
@ericl @mallman and @cloud-fan |
|
Test build #70681 has finished for PR 16424 at commit
|
docs/sql-programming-guide.md
Outdated
| however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key | ||
| and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table | ||
| Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`, | ||
| however. This functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you going to remove however ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it say: However, this functionality?
docs/sql-programming-guide.md
Outdated
|
|
||
| `DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable` | ||
| command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a | ||
| command. Notice that existing Hive deployment is not necessary to use this feature. Spark will create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: an existing Hive deployment
docs/sql-programming-guide.md
Outdated
|
|
||
| Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits: | ||
|
|
||
| - Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
docs/sql-programming-guide.md
Outdated
| - Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions. | ||
| - Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API. | ||
|
|
||
| Note that partition information is not gathered by default when creating an external datasource tables (those with a `path` option). You may want to invoke `MSCK REPAIR TABLE` to trigger partition discovery and persist per-partition information into metastore before querying a created external table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/an external/external
To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.
|
@ericl @CodingCat Thanks for the review! Fixed per your comments. |
499213e to
99337cd
Compare
|
Test build #70728 has finished for PR 16424 at commit
|
|
Test build #70729 has finished for PR 16424 at commit
|
|
LGTM, just one comment |
|
Test build #70732 has finished for PR 16424 at commit
|
|
OK, I'm merging this to master and branch-2.1. Thanks for the review! |
This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. N/A. Author: Cheng Lian <lian@databricks.com> Closes #16424 from liancheng/scalable-partition-handling-doc. (cherry picked from commit 871f611) Signed-off-by: Cheng Lian <lian@databricks.com>
## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes apache#16424 from liancheng/scalable-partition-handling-doc.
## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes apache#16424 from liancheng/scalable-partition-handling-doc.
What changes were proposed in this pull request?
This PR documents the scalable partition handling feature in the body of the programming guide.
Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra
MSCK REPAIR TABLEcommand is to have per-partition information persisted since 2.1.How was this patch tested?
N/A.