-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28301][SQL] fix the behavior of table name resolution with multi-catalog #25077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed anymore because in #24741 we delay the error reporting of unresolved relation to CheckAnalysis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test well demonstrates the expected behavior after this fix.
|
Test build #107354 has finished for PR 25077 at commit
|
|
retest this please |
|
Test build #107355 has finished for PR 25077 at commit
|
|
Thank you for pinging me, @cloud-fan . The failure seems to be the same one. |
|
Retest this please. |
|
/cc @brkyvz @jose-torres |
|
Test build #107363 has finished for PR 25077 at commit
|
|
-1 I agree with the idea behind this PR, which is that the v2 catalog default should be applied the same way in all cases... that's why I fixed this in #24768. From that PR's description:
And from discussion:
While this PR includes some additional fixes for handling for temporary tables, I don't understand why this was necessary when another PR is proposing the same behavior change and could be updated to fix those issues as well. There are also significant problems with this alternative implementation. It reverses choices that were made to minimize the risk to existing users of adding multi-catalog support to Spark 3.0. The plan is to change as few parts of v1 as possible to ensure its behavior does not change, but this PR makes the v1 Similarly, To sum it up, this appears to be proposing a different implementation, not different behavior. But it is quite invasive and different from the design that we have been building to up to now. Some of these changes may be good ideas, but let's separate them into their own PRs and discuss the merits of each. If you want to change the design to add a proxy catalog, then please bring it up at the sync or write up a proposal. |
|
The reasons that I open this PR:
The
Thanks for catching this bug! I feel it's cleaner to call Note that, the main point of this PR is to trigger discussion and reach a consensus about the behavior of table name resolution, including all the details. Once we reach a consensus, we can merge your PR first as long as it matches the expected behavior (you can take some tests in this PR), and rebase my PR as a refactor. |
|
Test build #107387 has finished for PR 25077 at commit
|
|
To avoid future confusion, I've reverted the changes in |
|
Test build #107397 has finished for PR 25077 at commit
|
|
retest this please |
|
Test build #107399 has finished for PR 25077 at commit
|
|
Test build #107404 has finished for PR 25077 at commit
|
Usually, I would expect you to point out problems with a PR in a review instead of opening a separate PR that doesn't mention the original. In the future, just coordinate about how to get the work done.
That PR solves these two at the same time because the default catalog was introduced so that v2 providers could be used. If the default catalog behavior changes, then the v2 session catalog needs to be introduced to handle those cases. Similarly, if we add the v2 session catalog, then we need to update the table resolution rules. It is reasonable to fix those two at the same time. Fixing just table resolution in this PR breaks v2 providers, which is unnecessary.
I'm not sure what you mean. The
This doesn't make sense to me because we agree about what the behavior should be. This doesn't propose substantial changes to the behavior in the other PR, just minor fixes to temporary table handling that we all agree on. In practice, this PR only introduces an alternative implementation that breaks existing conventions and approaches. |
Have we ever supported v2 provider in Hive catalog before? I think one major argument here is: can we separate "fix the table name resolution" and "support v2 provider with Hive catalog"? I don't see a strong reason to do them together and I open this PR to show that it's possible to only fix table name resolution. |
| throw new AnalysisException(s"Not allowed to create a permanent view $name by " + | ||
| s"referencing a temporary function `${e.name}`") | ||
| }) | ||
| case _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to add a comment like do nothing
| /** | ||
| * Permanent views are not allowed to reference temp objects, including temp function and views | ||
| */ | ||
| private def verifyTemporaryObjectsNotExists(sparkSession: SparkSession): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the sake of consistency, I think it'd be better to use either sparkSession or session like the below methods after this one
What changes were proposed in this pull request?
Now users can register multiple catalogs in Spark, and the table name resolution should be compatible with multi-catalog. The expected behavior is simple:
a.b.c.ais a registered catalog, then it's a tablecunder namespacebin cataloga.ais not a registered catalog, then it's a tablecunder namespacea.bin the default catalog.However, we need to change the expected behavior a little bit because the builtin hive catalog hasn't migrated to the new catalog API yet:
The current behavior of table name resolution is a little confusing:
This PR fixes the behavior of the table name resolution:
Analyzerinstead ofSparkSessiontracks all the registered catalogs. This is becauseSparkSessionis in sql/core not sql/catalyst.DataSourceResolutiononly resolves table name to the hive catalog when the catalog is not specified in the table name and default catalog is not set.How was this patch tested?
new test cases