Skip to content

Conversation

@imback82
Copy link
Contributor

What changes were proposed in this pull request?

Implement the SHOW DATABASES logical and physical plans for data source v2 tables.

Why are the changes needed?

To support SHOW DATABASES SQL commands for v2 tables.

Does this PR introduce any user-facing change?

spark.sql("SHOW DATABASES") will return namespaces if the default catalog is set:

+---------------+
|      namespace|
+---------------+
|            ns1|
|      ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+

How was this patch tested?

Added unit tests to DataSourceV2SQLSuite.

@imback82
Copy link
Contributor Author

cc: @rdblue @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 28, 2019

Test build #109836 has finished for PR 25601 at commit 87fe6ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

shall we call it SHOW NAMESPACES?

@imback82
Copy link
Contributor Author

Do we need to support both SHOW DATABASES and SHOW NAMESPACES or just SHOW NAMESPACES?

@cloud-fan
Copy link
Contributor

I think we have to keep SHOW DATABASES for backward compatibility. We can just treat SHOW DATABASES as an alias of SHOW NAMESPACES. @rdblue what do you think?

pattern: Option[String])
extends LeafExecNode {
override protected def doExecute(): RDD[InternalRow] = {
val namespaces = catalog.listNamespaces().flatMap(getNamespaces(catalog, _))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't list the entire space. It should only call listNamespaces once. If the current namespace is and empty array then call listNamespaces() and if it is anything else, call listNamespaces(current).

From the SPIP:

For a given operation, Spark will call the corresponding catalog method once. For example, SHOW TABLES will return results from listTables(currentNamespace). Spark will not traverse nested namespaces with multiple calls to listNamespaces and listTables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current namespace is and empty array then call listNamespaces()

I just realized that this isn't the same behavior as v1. In v1, SHOW DATABASES ignores the current database because databases aren't nested. It always lists all databases (then filters).

The proposed behavior of SHOW NAMESPACES was to respect the current namespace and list namespaces nested in it.

There are a few options to fix this:

  • Add SHOW NAMESPACES that behaves differently than SHOW DATABASES
  • Make SHOW NAMESPACES list all namespaces recursively, like this PR
  • Make SHOW NAMESPACES list the namespace above the current. If current=a.b, then list a and show the results (including b).
  • Change the behavior of SHOW DATABASES to match SHOW NAMESPACES and list the current
  • Change the behavior of SHOW DATABASES to match SHOW NAMESPACES and list the current, but match behavior if the current namespace is "default"

@imback82, @brkyvz, @cloud-fan, @mccheah, any opinion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SHOW NAMESPACES that behaves differently than SHOW DATABASES

I prefer this.

Another idea is: SHOW NAMESPACES should list the root namespaces of the current catalog, no matter what the current namespace is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to ignore the current namespace entirely. SHOW NAMESPACES would list the root, and SHOW NAMESPACES IN ns1 lists namespaces in ns1. The context is always explicit.

I think I would prefer that option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to ignore the current namespace entirely. SHOW NAMESPACES would list the root, and SHOW NAMESPACES IN ns1 lists namespaces in ns1. The context is always explicit.

I like this idea. @cloud-fan are you OK with this approach?

Copy link
Contributor

@cloud-fan cloud-fan Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's exactly what I mentioned before, with addition of SHOW NAMESPACES IN ns1, +1

Another idea is: SHOW NAMESPACES should list the root namespaces of the current catalog, no matter what the current namespace is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @rdblue Thanks for your suggestions.

| DROP database (IF EXISTS)? db=errorCapturingIdentifier
(RESTRICT | CASCADE)? #dropDatabase
| SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases
| SHOW NAMESPACES ((FROM | IN) multipartIdentifier)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put both FROM and IN similar to SHOW TABLES. Please let me know if FROM is not needed.

extends LeafExecNode {
override protected def doExecute(): RDD[InternalRow] = {
val namespaces = namespace.map{ ns =>
if (ns.nonEmpty) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue @cloud-fan this is for handling the case SHOW NAMESPACES IN catalogname. In this case, should we list the root namespaces or call listNamespaces with an empty array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we list the root namespaces or call listNamespaces with an empty array?

I think these 2 are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the SPIP, I see the following:

SHOW NAMESPACES IN foo
    Returns the result of
sparkSession.catalog("foo").listNamespaces().

Since the behavior of listNamespaces(Array()) depends on the implementation, I think it's safe to check and call listNamespaces(). @rdblue What do you think?

Copy link
Contributor

@rdblue rdblue Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling listNamespaces() sounds good to me.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110080 has finished for PR 25601 at commit 9974a58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110081 has finished for PR 25601 at commit ba1e7f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110083 has finished for PR 25601 at commit 9f738cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<tr><td>MONTH</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr>
<tr><td>MONTHS</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
<tr><td>MSCK</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
<tr><td>NAMESPACES</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xianyinxin , we should also add DELETE and UPDATE. Can you open a PR to do it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will open a pr.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELETE is already there. UPDATE is included in #25626

ShowTablesStatement(Some(Seq("tbl")), Some("*dog*")))
}

test("show namespaces") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xianyinxin can you add similar parser tests for DELETE/UPDATE as well?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For update, #25626 has added some parser cases. For delete, will done in #25652.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110093 has finished for PR 25601 at commit 672f526.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110105 has finished for PR 25601 at commit 672f526.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val CatalogNamespace(maybeCatalog, ns) = namespace
maybeCatalog match {
case Some(catalog: SupportsNamespaces) =>
ShowNamespaces(catalog, Some(ns), pattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to write if (ns.nonEmpty) Some(ns) else None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this logic in ShowNamespacesExec. If we move here, there is an implicit contract that if namespace is Some, it should be nonEmpty (meaning I need to add a require check to make it explicit in ShowNamespaceExec).

*/
case class ShowNamespaces(
catalog: SupportsNamespaces,
namespace: Option[Seq[String]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code, it's actually catalogAndNamespace, right?

Copy link
Contributor Author

@imback82 imback82 Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just namespace since catalog is already resolved in DataSourceResolution.scala.

/**
* A SHOW NAMESPACES statement, as parsed from SQL.
*/
case class ShowNamespacesStatement(namespace: Option[Seq[String]], pattern: Option[String])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is catalog + namespace, but I followed the same convention as other statements - i.e., CreateTableStatement has tableName instead of catalogAndTableName. Please let me know if you prefer catalogAndNamespace here.

@cloud-fan
Copy link
Contributor

LGTM except some code style issues.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110137 has finished for PR 25601 at commit 6256aeb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BasicInMemoryTableCatalog extends TableCatalog
  • class InMemoryTableCatalog extends BasicInMemoryTableCatalog with SupportsNamespaces

case Some(catalog: SupportsNamespaces) =>
ShowNamespaces(catalog, Some(ns), pattern)
case _ =>
throw new AnalysisException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to distinguish between the case where the catalog is None and the catalog does not support namespaces. For the second case, this should report that the catalog doesn't support namespaces. You can also add a conversion method, asNamespaceCatalog to CatalogV2Utils like asTableCatalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using asNamespaceCatalog simplifies the matching. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the current catalog instead of failing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the catalog name is specified, but catalog doesn't support namespace, I think we should fail instead of falling back to the current catalog.

It's similar to: if the catalog name is specified, but doesn't contain the table we need, we should fail instead of falling back to the current catalog.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110155 has finished for PR 25601 at commit 2295f91.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

DeleteFromTable(aliased, delete.condition)

case ShowNamespacesStatement(None, pattern) =>
defaultCatalog match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be currentCatalog instead. @cloud-fan, do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's implement switching the current catalog first, otherwise we are not able to test it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 are you working on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am working on USE NAMESPACE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to send out the PR sometime tomorrow.

@rdblue
Copy link
Contributor

rdblue commented Sep 6, 2019

This looks good to me other than the behavior when the catalog is not included in the IN clause. This uses defaultCatalog, but now the way we track catalogs has changed a bit:

  • The current catalog should be used when no catalog is specified
  • The default catalog is the catalog current is initialized to
  • If the default catalog is not set, then it is the built-in Spark session catalog, which I think we intend to call spark_catalog

It looks like these rules are not universally followed, so I've opened SPARK-29014 to track this clean-up. I'm okay merging this and fixing the catalog defaults in a follow-up.

FYI: @cloud-fan, @brkyvz.

@imback82
Copy link
Contributor Author

@cloud-fan do you have any other comments other than what @rdblue brought up?

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110378 has finished for PR 25601 at commit 9a55a03.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110393 has finished for PR 25601 at commit 9a55a03.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 10, 2019

Test build #110411 has finished for PR 25601 at commit 9a55a03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bf43541 Sep 10, 2019
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.

### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.

### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
|      namespace|
+---------------+
|            ns1|
|      ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```

### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.

Closes apache#25601 from imback82/show_databases.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants