[SPARK-28301][SQL] fix the behavior of table name resolution with multi-catalog #25077

cloud-fan · 2019-07-08T12:47:12Z

What changes were proposed in this pull request?

Now users can register multiple catalogs in Spark, and the table name resolution should be compatible with multi-catalog. The expected behavior is simple:

For DDL commands that can only deal with tables
- If the table name has only one name part, then it's a table in the default catalog.
- If the table name has more than one name part like a.b.c.
  - if a is a registered catalog, then it's a table c under namespace b in catalog a.
  - if a is not a registered catalog, then it's a table c under namespace a.b in the default catalog.
For SELECT/INSERT that can handle both tables and temp views, first check if the table name is a temp view or global temp view, otherwise the rule is the same as DDL commands.

However, we need to change the expected behavior a little bit because the builtin hive catalog hasn't migrated to the new catalog API yet:

If the default catalog config is set, pick it as the default catalog. Otherwise pick hive catalog as the default catalog.
If the default catalog config is not set, and the table name has more than 2 name parts. We should fail with "no catalog specified for table"

The current behavior of table name resolution is a little confusing:

For DDL commands that can only deal with tables
- If the first part of the table name matches a registered catalog, then it's a table in that catalog. (expected)
- Otherwise, if the table name has less than 3 parts, and the provider name is v1, go with the builtin Hive catalog. (This is not expected. By design different catalogs can interprete table provider name differently. We should go with the default catalog if the config is set, no matter what the table provider name is.)
For SELECT/INSERT that can handle both tables and temp views
- If the first part of the table name matches a registered catalog, then it's a table in that catalog. (expected)
- Otherwise, if the table name has less than 3 parts, go with the builtin Hive catalog. (This is not expected as we need to respect the default catalog config.)
- Otherwise, the query is unresolved. (This is not expected as we need to respect the default catalog config.)

This PR fixes the behavior of the table name resolution:

Now Analyzer instead of SparkSession tracks all the registered catalogs. This is because SparkSession is in sql/core not sql/catalyst.
The rule DataSourceResolution only resolves table name to the hive catalog when the catalog is not specified in the table name and default catalog is not set.

How was this patch tested?

new test cases

cloud-fan · 2019-07-08T12:54:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

This is not needed anymore because in #24741 we delay the error reporting of unresolved relation to CheckAnalysis

cloud-fan · 2019-07-08T13:04:06Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala

This test well demonstrates the expected behavior after this fix.

SparkQA · 2019-07-08T13:52:09Z

Test build #107354 has finished for PR 25077 at commit 1231ebb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

cloud-fan · 2019-07-08T13:59:37Z

retest this please

cloud-fan · 2019-07-08T14:00:19Z

cc @rxin @marmbrus @rdblue @dongjoon-hyun @gengliangwang

SparkQA · 2019-07-08T14:25:51Z

Test build #107355 has finished for PR 25077 at commit 1231ebb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

dongjoon-hyun · 2019-07-08T17:14:43Z

Thank you for pinging me, @cloud-fan . The failure seems to be the same one.

dongjoon-hyun · 2019-07-08T17:14:50Z

Retest this please.

marmbrus · 2019-07-08T17:28:53Z

/cc @brkyvz @jose-torres

SparkQA · 2019-07-08T17:38:12Z

Test build #107363 has finished for PR 25077 at commit 1231ebb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogManager(conf: SQLConf, val externalCatalog: ExternalCatalog)

rdblue · 2019-07-09T00:12:35Z

-1

I agree with the idea behind this PR, which is that the v2 catalog default should be applied the same way in all cases... that's why I fixed this in #24768. From that PR's description:

Updates catalog lookup to always provide the default if it is set for consistent behavior.

And from discussion:

After this PR, the rules to determine the catalog responsible for an identifier are:

If the identifier starts with a known catalog, use it

If there is a configured default v2 catalog, use that catalog

Otherwise, the session catalog is responsible for the identifier

While this PR includes some additional fixes for handling for temporary tables, I don't understand why this was necessary when another PR is proposing the same behavior change and could be updated to fix those issues as well.

There are also significant problems with this alternative implementation. It reverses choices that were made to minimize the risk to existing users of adding multi-catalog support to Spark 3.0. The plan is to change as few parts of v1 as possible to ensure its behavior does not change, but this PR makes the v1 SessionCatalog responsible for proxying calls to load v2 tables by modifying the signature and behavior of lookupRelation. These changes are not needed to update the default catalog behavior. Maybe this is the right approach, but it should be proposed and considered independently.

Similarly, AsTableIdentifier is used in rules that previously matched TableIdentifier and guarantees that the rule will not accidentally match an identifier with a catalog part. But this PR rewrites AsTableIdentifier so that it will match any 2-part identifier without the catalog check so it now must be applied after CatalogObjectIdentifier. Previously correct uses of AsTableIdentifier are now unsafe.

To sum it up, this appears to be proposing a different implementation, not different behavior. But it is quite invasive and different from the design that we have been building to up to now. Some of these changes may be good ideas, but let's separate them into their own PRs and discuss the merits of each. If you want to change the design to add a proxy catalog, then please bring it up at the sync or write up a proposal.

cloud-fan · 2019-07-09T03:28:17Z

The reasons that I open this PR:

The fix in [SPARK-27919][SQL] Add v2 session catalog #24768 leads to different behaviors than I expect: a) temp view should have higher priority. b) table provider should not affect table name resolution. The implementation also needs to be different for the proposed behaviors, that's one reason I open this PR.
[SPARK-27919][SQL] Add v2 session catalog #24768 does 2 things together: 1. fix the table name resolution. 2. support v2 provider with Hive catalog. Since discussion is needed for 1, I think it's more efficient to have an individual PR for it. Since the implementation in my mind is different, I open a new PR instead of asking you to separate the changes to a new PR. It also shows that it's possible to fix the table name resolution behavior without introducing v2 session catalog.

but this PR makes the v1 SessionCatalog responsible for proxying calls to load v2 tables by modifying the signature and behavior of lookupRelation.

The lookupRelation should belong to analyzer instead of catalog, since catalog doesn't know the relation plans Spark uses. We put it in SessionCatalog for historical reasons, I can move it to analyzer in a new PR if it's better. To clarify, I agree with you that v1 session catalog should not proxy calls to v2 catalogs.

But this PR rewrites AsTableIdentifier so that it will match any 2-part identifier without the catalog check so it now must be applied after CatalogObjectIdentifier.

Thanks for catching this bug! I feel it's cleaner to call AsTableIdentifier after CatalogObjectIdentifier, but we must guarantee that when AsTableIdentifier is called, no v2 catalog can be found. I've pushed a commit to fix it.

Note that, the main point of this PR is to trigger discussion and reach a consensus about the behavior of table name resolution, including all the details. Once we reach a consensus, we can merge your PR first as long as it matches the expected behavior (you can take some tests in this PR), and rebase my PR as a refactor.

SparkQA · 2019-07-09T04:01:18Z

Test build #107387 has finished for PR 25077 at commit d3f23b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-09T06:24:56Z

To avoid future confusion, I've reverted the changes in SessionCatalog and moved the related changes to Analyzer.

SparkQA · 2019-07-09T07:05:02Z

Test build #107397 has finished for PR 25077 at commit 6a27efe.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogManager(conf: SQLConf)

cloud-fan · 2019-07-09T07:49:01Z

retest this please

SparkQA · 2019-07-09T09:23:33Z

Test build #107399 has finished for PR 25077 at commit 6a27efe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogManager(conf: SQLConf)

SparkQA · 2019-07-09T15:37:19Z

Test build #107404 has finished for PR 25077 at commit 4ed4399.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-09T16:25:57Z

The fix in #24768 leads to different behaviors than I expect: a) temp view should have higher priority. b) table provider should not affect table name resolution. The implementation also needs to be different for the proposed behaviors, that's one reason I open this PR.

Usually, I would expect you to point out problems with a PR in a review instead of opening a separate PR that doesn't mention the original. In the future, just coordinate about how to get the work done.

fix the table name resolution. 2. support v2 provider with Hive catalog

That PR solves these two at the same time because the default catalog was introduced so that v2 providers could be used. If the default catalog behavior changes, then the v2 session catalog needs to be introduced to handle those cases. Similarly, if we add the v2 session catalog, then we need to update the table resolution rules.

It is reasonable to fix those two at the same time. Fixing just table resolution in this PR breaks v2 providers, which is unnecessary.

The lookupRelation should belong to analyzer instead of catalog, since catalog doesn't know the relation plans Spark uses

I'm not sure what you mean. The ResolveTable rule already does this in the analyzer. Moving this to SessionCatalog is a different approach entirely. Like I said, I'm happy to discuss the merits of this, but please don't mix these changes together.

the main point of this PR is to trigger discussion and reach a consensus about the behavior of table name resolution

This doesn't make sense to me because we agree about what the behavior should be. This doesn't propose substantial changes to the behavior in the other PR, just minor fixes to temporary table handling that we all agree on. In practice, this PR only introduces an alternative implementation that breaks existing conventions and approaches.

cloud-fan · 2019-07-10T03:25:33Z

It is reasonable to fix those two at the same time. Fixing just table resolution in this PR breaks v2 providers

Have we ever supported v2 provider in Hive catalog before?

I think one major argument here is: can we separate "fix the table name resolution" and "support v2 provider with Hive catalog"? I don't see a strong reason to do them together and I open this PR to show that it's possible to only fix table name resolution.

albertusk95 · 2019-07-24T07:25:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

            throw new AnalysisException(s"Not allowed to create a permanent view $name by " +
              s"referencing a temporary function `${e.name}`")
        })
+        case _ =>


I think it'd be better to add a comment like do nothing

albertusk95 · 2019-07-24T07:29:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

  /**
   * Permanent views are not allowed to reference temp objects, including temp function and views
   */
  private def verifyTemporaryObjectsNotExists(sparkSession: SparkSession): Unit = {


for the sake of consistency, I think it'd be better to use either sparkSession or session like the below methods after this one

cloud-fan commented Jul 8, 2019

View reviewed changes

cloud-fan force-pushed the catalog branch from e876c5c to 1b0632e Compare July 8, 2019 13:00

cloud-fan commented Jul 8, 2019

View reviewed changes

cloud-fan force-pushed the catalog branch from 1b0632e to d5b2db8 Compare July 8, 2019 13:16

cloud-fan mentioned this pull request Jul 8, 2019

[SPARK-27919][SQL] Add v2 session catalog #24768

Closed

fix the behavior of table name resolution with multi-catalog

1231ebb

cloud-fan force-pushed the catalog branch from d5b2db8 to 1231ebb Compare July 8, 2019 13:31

apache deleted a comment from SparkQA Jul 8, 2019

dongjoon-hyun added the SQL label Jul 8, 2019

fix a bug

d3f23b9

cloud-fan force-pushed the catalog branch from 783f65b to 91bfbdd Compare July 9, 2019 06:29

revert changes in SessionCatalog

6a27efe

cloud-fan force-pushed the catalog branch from 91bfbdd to 6a27efe Compare July 9, 2019 06:30

fix test

4ed4399

albertusk95 suggested changes Jul 24, 2019

View reviewed changes

cloud-fan closed this Sep 16, 2019

[SPARK-28301][SQL] fix the behavior of table name resolution with multi-catalog #25077

[SPARK-28301][SQL] fix the behavior of table name resolution with multi-catalog #25077

Uh oh!

Conversation

cloud-fan commented Jul 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan Jul 8, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 8, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

cloud-fan commented Jul 8, 2019

Uh oh!

cloud-fan commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

dongjoon-hyun commented Jul 8, 2019

Uh oh!

dongjoon-hyun commented Jul 8, 2019

Uh oh!

marmbrus commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

rdblue commented Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

cloud-fan commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

cloud-fan commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

rdblue commented Jul 9, 2019

Uh oh!

cloud-fan commented Jul 10, 2019

Uh oh!

albertusk95 Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

albertusk95 Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Jul 8, 2019 •

edited

Loading

rdblue commented Jul 9, 2019 •

edited

Loading

cloud-fan commented Jul 9, 2019 •

edited

Loading