[SPARK-19016][SQL][DOC] Document scalable partition handling #16424

liancheng · 2016-12-28T19:18:18Z

What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra MSCK REPAIR TABLE command is to have per-partition information persisted since 2.1.

How was this patch tested?

N/A.

yhuai · 2016-12-28T19:22:35Z

@ericl @mallman and @cloud-fan
want to take a look?

SparkQA · 2016-12-28T19:39:48Z

Test build #70681 has finished for PR 16424 at commit 08c9d20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

CodingCat · 2016-12-29T00:42:43Z

docs/sql-programming-guide.md

-however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key 
-and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table 
+Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`,
+however. This functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key


were you going to remove however ?

Should it say: However, this functionality?

ericl · 2016-12-29T19:22:11Z

docs/sql-programming-guide.md


 `DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
-command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a
+command. Notice that existing Hive deployment is not necessary to use this feature. Spark will create a


nit: an existing Hive deployment

ericl · 2016-12-29T19:23:18Z

docs/sql-programming-guide.md


+Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
+
+- Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions.


nit: Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.

ericl · 2016-12-29T19:24:47Z

docs/sql-programming-guide.md

+- Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions.
+- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.
+
+Note that partition information is not gathered by default when creating an external datasource tables (those with a `path` option). You may want to invoke `MSCK REPAIR TABLE` to trigger partition discovery and persist per-partition information into metastore before querying a created external table.


s/an external/external

To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

liancheng · 2016-12-29T21:21:23Z

@ericl @CodingCat Thanks for the review! Fixed per your comments.

SparkQA · 2016-12-29T21:40:43Z

Test build #70728 has finished for PR 16424 at commit 499213e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T21:45:21Z

Test build #70729 has finished for PR 16424 at commit 99337cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-12-29T22:35:44Z

LGTM, just one comment

SparkQA · 2016-12-29T23:23:33Z

Test build #70732 has finished for PR 16424 at commit dce40b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-12-30T22:46:02Z

OK, I'm merging this to master and branch-2.1. Thanks for the review!

This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. N/A. Author: Cheng Lian <lian@databricks.com> Closes #16424 from liancheng/scalable-partition-handling-doc. (cherry picked from commit 871f611) Signed-off-by: Cheng Lian <lian@databricks.com>

## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes apache#16424 from liancheng/scalable-partition-handling-doc.

liancheng added 2 commits December 28, 2016 11:14

Document scalable partition handling

2e47d7d

Minor update

08c9d20

CodingCat reviewed Dec 29, 2016

View reviewed changes

ericl reviewed Dec 29, 2016

View reviewed changes

Address PR comments

99337cd

liancheng force-pushed the scalable-partition-handling-doc branch from 499213e to 99337cd Compare December 29, 2016 21:22

Address PR comment

dce40b5

asfgit closed this in 871f611 Dec 30, 2016

liancheng deleted the scalable-partition-handling-doc branch December 30, 2016 22:52


		Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:

		- Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions.

[SPARK-19016][SQL][DOC] Document scalable partition handling #16424

[SPARK-19016][SQL][DOC] Document scalable partition handling #16424

Uh oh!

Conversation

liancheng commented Dec 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yhuai commented Dec 28, 2016

Uh oh!

SparkQA commented Dec 28, 2016

Uh oh!

CodingCat Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Dec 29, 2016

Choose a reason for hiding this comment

Uh oh!

liancheng commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

ericl commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

liancheng commented Dec 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants