-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16311][SQL] Improve metadata refresh #13989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @rxin |
|
cc @cloud-fan / @liancheng |
|
Before, I tried to merge I think maybe we can keep them separately? |
| * @group action | ||
| * @since 2.0.0 | ||
| */ | ||
| def refresh(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will remove the cached data. This is different from what JIRA describes. CC @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other refresh methods also remove cached data, so I thought this is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new API has different behaviors from the refreshTable API and Refresh Table SQL statement. See the following code:
spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala
Lines 349 to 374 in 02a029d
| /** | |
| * Refresh the cache entry for a table, if any. For Hive metastore table, the metadata | |
| * is refreshed. | |
| * | |
| * @group cachemgmt | |
| * @since 2.0.0 | |
| */ | |
| override def refreshTable(tableName: String): Unit = { | |
| val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) | |
| sessionCatalog.refreshTable(tableIdent) | |
| // If this table is cached as an InMemoryRelation, drop the original | |
| // cached version and make the new version cached lazily. | |
| val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) | |
| // Use lookupCachedData directly since RefreshTable also takes databaseName. | |
| val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty | |
| if (isCached) { | |
| // Create a data frame to represent the table. | |
| // TODO: Use uncacheTable once it supports database name. | |
| val df = Dataset.ofRows(sparkSession, logicalPlan) | |
| // Uncache the logicalPlan. | |
| sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) | |
| // Cache it again. | |
| sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) | |
| } | |
| } |
IMO, if we using the word refresh, we have to make them consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ic - we can't unpersist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can unpersist, but should persist it again immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can and should call unpersist, but we should also call persist()/cache() again so that the Dataset will be cached lazily again with correct data when it gets executed next time. I guess that's also what @gatorsmile meant.
|
Test cases are not enough to cover the metadata refreshing. The current metadata cache is only used for data source tables. We still could convert Hive tables to data source tables. For example, parquet and orc. Thus, we also need to check the behaviors of these cases. Try to design more test cases for metadata refreshing, including both positive and negative cases. |
|
What do you mean by both positive and negative cases? |
|
For example, I try to refresh the metadata of a DataFrame that has multiple leaf nodes of Update: just correct the contents. |
|
Test build #61524 has finished for PR 13989 at commit
|
| } | ||
|
|
||
| /** | ||
| * Invalidates any metadata cached in the plan recursively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Refreshes" instead of "Invalidates"?
|
Would this work? Traverse the logical plan to find whether it references any catalog relation, and if it does, call catalog.refreshTable("...")? For example |
|
One concern of mine is that, analyzed plan, optimized plan, and executed (physical) plan stored in Say we constructed a DataFrame Next, we add a bunch of files into the directory where table |
|
In general, I think reconstructing a DataFrame/Dataset or using |
|
I think @liancheng has a good point. Why don't we take out Dataset.refresh() for now? |
|
Alright I will do that and submit a new pull request. Note that I think the data frame refresh is already possible via table refresh, if a data frame references a table, or if some view references a data frame. |
[SPARK-16311][SQL] Improve metadata refresh
|
Test build #3159 has finished for PR 13989 at commit
|
## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311.
## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on #13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311. (cherry picked from commit 16a2a7d) Signed-off-by: Reynold Xin <rxin@databricks.com>
What changes were proposed in this pull request?
This patch implements the 3 things specified in SPARK-16311:
(1) Append a message to the FileNotFoundException and say that a workaround is to do explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames, by introducing a Dataset.refresh() method.
And one additional small change:
(4) Merge invalidateTable and refreshTable.
How was this patch tested?
Created a new test suite that creates a temporary directory and then deletes a file from it to verify Spark can read the directory once refresh is called.