Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 8, 2019

What changes were proposed in this pull request?

Now users can register multiple catalogs in Spark, and the table name resolution should be compatible with multi-catalog. The expected behavior is simple:

  • For DDL commands that can only deal with tables
    • If the table name has only one name part, then it's a table in the default catalog.
    • If the table name has more than one name part like a.b.c.
      • if a is a registered catalog, then it's a table c under namespace b in catalog a.
      • if a is not a registered catalog, then it's a table c under namespace a.b in the default catalog.
  • For SELECT/INSERT that can handle both tables and temp views, first check if the table name is a temp view or global temp view, otherwise the rule is the same as DDL commands.

However, we need to change the expected behavior a little bit because the builtin hive catalog hasn't migrated to the new catalog API yet:

  1. If the default catalog config is set, pick it as the default catalog. Otherwise pick hive catalog as the default catalog.
  2. If the default catalog config is not set, and the table name has more than 2 name parts. We should fail with "no catalog specified for table"

The current behavior of table name resolution is a little confusing:

  • For DDL commands that can only deal with tables
    • If the first part of the table name matches a registered catalog, then it's a table in that catalog. (expected)
    • Otherwise, if the table name has less than 3 parts, and the provider name is v1, go with the builtin Hive catalog. (This is not expected. By design different catalogs can interprete table provider name differently. We should go with the default catalog if the config is set, no matter what the table provider name is.)
  • For SELECT/INSERT that can handle both tables and temp views
    • If the first part of the table name matches a registered catalog, then it's a table in that catalog. (expected)
    • Otherwise, if the table name has less than 3 parts, go with the builtin Hive catalog. (This is not expected as we need to respect the default catalog config.)
    • Otherwise, the query is unresolved. (This is not expected as we need to respect the default catalog config.)

This PR fixes the behavior of the table name resolution:

  1. Now Analyzer instead of SparkSession tracks all the registered catalogs. This is because SparkSession is in sql/core not sql/catalyst.
  2. The rule DataSourceResolution only resolves table name to the hive catalog when the catalog is not specified in the table name and default catalog is not set.

How was this patch tested?

new test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed anymore because in #24741 we delay the error reporting of unresolved relation to CheckAnalysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test well demonstrates the expected behavior after this fix.

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107354 has finished for PR 25077 at commit 1231ebb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

@apache apache deleted a comment from SparkQA Jul 8, 2019
@apache apache deleted a comment from SparkQA Jul 8, 2019
@apache apache deleted a comment from SparkQA Jul 8, 2019
@cloud-fan
Copy link
Contributor Author

retest this please

@cloud-fan
Copy link
Contributor Author

cc @rxin @marmbrus @rdblue @dongjoon-hyun @gengliangwang

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107355 has finished for PR 25077 at commit 1231ebb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan . The failure seems to be the same one.

@dongjoon-hyun
Copy link
Member

Retest this please.

@marmbrus
Copy link
Contributor

marmbrus commented Jul 8, 2019

/cc @brkyvz @jose-torres

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #107363 has finished for PR 25077 at commit 1231ebb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

@rdblue
Copy link
Contributor

rdblue commented Jul 9, 2019

-1

I agree with the idea behind this PR, which is that the v2 catalog default should be applied the same way in all cases... that's why I fixed this in #24768. From that PR's description:

Updates catalog lookup to always provide the default if it is set for consistent behavior.

And from discussion:

After this PR, the rules to determine the catalog responsible for an identifier are:

  1. If the identifier starts with a known catalog, use it
  2. If there is a configured default v2 catalog, use that catalog
  3. Otherwise, the session catalog is responsible for the identifier

While this PR includes some additional fixes for handling for temporary tables, I don't understand why this was necessary when another PR is proposing the same behavior change and could be updated to fix those issues as well.

There are also significant problems with this alternative implementation. It reverses choices that were made to minimize the risk to existing users of adding multi-catalog support to Spark 3.0. The plan is to change as few parts of v1 as possible to ensure its behavior does not change, but this PR makes the v1 SessionCatalog responsible for proxying calls to load v2 tables by modifying the signature and behavior of lookupRelation. These changes are not needed to update the default catalog behavior. Maybe this is the right approach, but it should be proposed and considered independently.

Similarly, AsTableIdentifier is used in rules that previously matched TableIdentifier and guarantees that the rule will not accidentally match an identifier with a catalog part. But this PR rewrites AsTableIdentifier so that it will match any 2-part identifier without the catalog check so it now must be applied after CatalogObjectIdentifier. Previously correct uses of AsTableIdentifier are now unsafe.

To sum it up, this appears to be proposing a different implementation, not different behavior. But it is quite invasive and different from the design that we have been building to up to now. Some of these changes may be good ideas, but let's separate them into their own PRs and discuss the merits of each. If you want to change the design to add a proxy catalog, then please bring it up at the sync or write up a proposal.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jul 9, 2019

The reasons that I open this PR:

  1. The fix in [SPARK-27919][SQL] Add v2 session catalog #24768 leads to different behaviors than I expect: a) temp view should have higher priority. b) table provider should not affect table name resolution. The implementation also needs to be different for the proposed behaviors, that's one reason I open this PR.
  2. [SPARK-27919][SQL] Add v2 session catalog #24768 does 2 things together: 1. fix the table name resolution. 2. support v2 provider with Hive catalog. Since discussion is needed for 1, I think it's more efficient to have an individual PR for it. Since the implementation in my mind is different, I open a new PR instead of asking you to separate the changes to a new PR. It also shows that it's possible to fix the table name resolution behavior without introducing v2 session catalog.

but this PR makes the v1 SessionCatalog responsible for proxying calls to load v2 tables by modifying the signature and behavior of lookupRelation.

The lookupRelation should belong to analyzer instead of catalog, since catalog doesn't know the relation plans Spark uses. We put it in SessionCatalog for historical reasons, I can move it to analyzer in a new PR if it's better. To clarify, I agree with you that v1 session catalog should not proxy calls to v2 catalogs.

But this PR rewrites AsTableIdentifier so that it will match any 2-part identifier without the catalog check so it now must be applied after CatalogObjectIdentifier.

Thanks for catching this bug! I feel it's cleaner to call AsTableIdentifier after CatalogObjectIdentifier, but we must guarantee that when AsTableIdentifier is called, no v2 catalog can be found. I've pushed a commit to fix it.

Note that, the main point of this PR is to trigger discussion and reach a consensus about the behavior of table name resolution, including all the details. Once we reach a consensus, we can merge your PR first as long as it matches the expected behavior (you can take some tests in this PR), and rebase my PR as a refactor.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107387 has finished for PR 25077 at commit d3f23b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

To avoid future confusion, I've reverted the changes in SessionCatalog and moved the related changes to Analyzer.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107397 has finished for PR 25077 at commit 6a27efe.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf)

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107399 has finished for PR 25077 at commit 6a27efe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf)

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107404 has finished for PR 25077 at commit 4ed4399.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jul 9, 2019

The fix in #24768 leads to different behaviors than I expect: a) temp view should have higher priority. b) table provider should not affect table name resolution. The implementation also needs to be different for the proposed behaviors, that's one reason I open this PR.

Usually, I would expect you to point out problems with a PR in a review instead of opening a separate PR that doesn't mention the original. In the future, just coordinate about how to get the work done.

  1. fix the table name resolution. 2. support v2 provider with Hive catalog

That PR solves these two at the same time because the default catalog was introduced so that v2 providers could be used. If the default catalog behavior changes, then the v2 session catalog needs to be introduced to handle those cases. Similarly, if we add the v2 session catalog, then we need to update the table resolution rules.

It is reasonable to fix those two at the same time. Fixing just table resolution in this PR breaks v2 providers, which is unnecessary.

The lookupRelation should belong to analyzer instead of catalog, since catalog doesn't know the relation plans Spark uses

I'm not sure what you mean. The ResolveTable rule already does this in the analyzer. Moving this to SessionCatalog is a different approach entirely. Like I said, I'm happy to discuss the merits of this, but please don't mix these changes together.

the main point of this PR is to trigger discussion and reach a consensus about the behavior of table name resolution

This doesn't make sense to me because we agree about what the behavior should be. This doesn't propose substantial changes to the behavior in the other PR, just minor fixes to temporary table handling that we all agree on. In practice, this PR only introduces an alternative implementation that breaks existing conventions and approaches.

@cloud-fan
Copy link
Contributor Author

It is reasonable to fix those two at the same time. Fixing just table resolution in this PR breaks v2 providers

Have we ever supported v2 provider in Hive catalog before?

I think one major argument here is: can we separate "fix the table name resolution" and "support v2 provider with Hive catalog"? I don't see a strong reason to do them together and I open this PR to show that it's possible to only fix table name resolution.

throw new AnalysisException(s"Not allowed to create a permanent view $name by " +
s"referencing a temporary function `${e.name}`")
})
case _ =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to add a comment like do nothing

/**
* Permanent views are not allowed to reference temp objects, including temp function and views
*/
private def verifyTemporaryObjectsNotExists(sparkSession: SparkSession): Unit = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the sake of consistency, I think it'd be better to use either sparkSession or session like the below methods after this one

@cloud-fan cloud-fan closed this Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants