Skip to content

Conversation

@liancheng
Copy link
Contributor

What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra MSCK REPAIR TABLE command is to have per-partition information persisted since 2.1.

How was this patch tested?

N/A.

@yhuai
Copy link
Contributor

yhuai commented Dec 28, 2016

@ericl @mallman and @cloud-fan
want to take a look?

@SparkQA
Copy link

SparkQA commented Dec 28, 2016

Test build #70681 has finished for PR 16424 at commit 08c9d20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table
Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`,
however. This functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you going to remove however ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it say: However, this functionality?


`DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a
command. Notice that existing Hive deployment is not necessary to use this feature. Spark will create a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: an existing Hive deployment


Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:

- Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.

- Since full information of all partitions can be retrieved from metastore, excessive partition discovery is no longer needed. This greatly saves query planning time for partitioned tables with a large number of partitions.
- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.

Note that partition information is not gathered by default when creating an external datasource tables (those with a `path` option). You may want to invoke `MSCK REPAIR TABLE` to trigger partition discovery and persist per-partition information into metastore before querying a created external table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/an external/external

To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

@liancheng
Copy link
Contributor Author

@ericl @CodingCat Thanks for the review! Fixed per your comments.

@liancheng liancheng force-pushed the scalable-partition-handling-doc branch from 499213e to 99337cd Compare December 29, 2016 21:22
@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70728 has finished for PR 16424 at commit 499213e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70729 has finished for PR 16424 at commit 99337cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor

ericl commented Dec 29, 2016

LGTM, just one comment

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70732 has finished for PR 16424 at commit dce40b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

OK, I'm merging this to master and branch-2.1. Thanks for the review!

@asfgit asfgit closed this in 871f611 Dec 30, 2016
asfgit pushed a commit that referenced this pull request Dec 30, 2016
This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.

(cherry picked from commit 871f611)
Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng liancheng deleted the scalable-partition-handling-doc branch December 30, 2016 22:52
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 1, 2017
## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes apache#16424 from liancheng/scalable-partition-handling-doc.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes apache#16424 from liancheng/scalable-partition-handling-doc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants