-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28856][SQL] Implement SHOW DATABASES for Data Source V2 Tables #25601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @rdblue @cloud-fan |
|
Test build #109836 has finished for PR 25601 at commit
|
|
shall we call it |
|
Do we need to support both |
|
I think we have to keep |
| pattern: Option[String]) | ||
| extends LeafExecNode { | ||
| override protected def doExecute(): RDD[InternalRow] = { | ||
| val namespaces = catalog.listNamespaces().flatMap(getNamespaces(catalog, _)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't list the entire space. It should only call listNamespaces once. If the current namespace is and empty array then call listNamespaces() and if it is anything else, call listNamespaces(current).
From the SPIP:
For a given operation, Spark will call the corresponding catalog method once. For example, SHOW TABLES will return results from listTables(currentNamespace). Spark will not traverse nested namespaces with multiple calls to listNamespaces and listTables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current namespace is and empty array then call listNamespaces()
I just realized that this isn't the same behavior as v1. In v1, SHOW DATABASES ignores the current database because databases aren't nested. It always lists all databases (then filters).
The proposed behavior of SHOW NAMESPACES was to respect the current namespace and list namespaces nested in it.
There are a few options to fix this:
- Add
SHOW NAMESPACESthat behaves differently thanSHOW DATABASES - Make
SHOW NAMESPACESlist all namespaces recursively, like this PR - Make
SHOW NAMESPACESlist the namespace above the current. Ifcurrent=a.b, then listaand show the results (includingb). - Change the behavior of
SHOW DATABASESto matchSHOW NAMESPACESand list the current - Change the behavior of
SHOW DATABASESto matchSHOW NAMESPACESand list the current, but match behavior if the current namespace is "default"
@imback82, @brkyvz, @cloud-fan, @mccheah, any opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SHOW NAMESPACES that behaves differently than SHOW DATABASES
I prefer this.
Another idea is: SHOW NAMESPACES should list the root namespaces of the current catalog, no matter what the current namespace is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to ignore the current namespace entirely. SHOW NAMESPACES would list the root, and SHOW NAMESPACES IN ns1 lists namespaces in ns1. The context is always explicit.
I think I would prefer that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to ignore the current namespace entirely.
SHOW NAMESPACESwould list the root, andSHOW NAMESPACES IN ns1lists namespaces inns1. The context is always explicit.
I like this idea. @cloud-fan are you OK with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's exactly what I mentioned before, with addition of SHOW NAMESPACES IN ns1, +1
Another idea is: SHOW NAMESPACES should list the root namespaces of the current catalog, no matter what the current namespace is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @rdblue Thanks for your suggestions.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
| | DROP database (IF EXISTS)? db=errorCapturingIdentifier | ||
| (RESTRICT | CASCADE)? #dropDatabase | ||
| | SHOW DATABASES (LIKE? pattern=STRING)? #showDatabases | ||
| | SHOW NAMESPACES ((FROM | IN) multipartIdentifier)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put both FROM and IN similar to SHOW TABLES. Please let me know if FROM is not needed.
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTableCatalog.scala
Outdated
Show resolved
Hide resolved
| extends LeafExecNode { | ||
| override protected def doExecute(): RDD[InternalRow] = { | ||
| val namespaces = namespace.map{ ns => | ||
| if (ns.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @cloud-fan this is for handling the case SHOW NAMESPACES IN catalogname. In this case, should we list the root namespaces or call listNamespaces with an empty array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we list the root namespaces or call listNamespaces with an empty array?
I think these 2 are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the SPIP, I see the following:
SHOW NAMESPACES IN foo
Returns the result of
sparkSession.catalog("foo").listNamespaces().
Since the behavior of listNamespaces(Array()) depends on the implementation, I think it's safe to check and call listNamespaces(). @rdblue What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling listNamespaces() sounds good to me.
|
Test build #110080 has finished for PR 25601 at commit
|
|
Test build #110081 has finished for PR 25601 at commit
|
|
Test build #110083 has finished for PR 25601 at commit
|
| <tr><td>MONTH</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr> | ||
| <tr><td>MONTHS</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr> | ||
| <tr><td>MSCK</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr> | ||
| <tr><td>NAMESPACES</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xianyinxin , we should also add DELETE and UPDATE. Can you open a PR to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will open a pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DELETE is already there. UPDATE is included in #25626
| ShowTablesStatement(Some(Seq("tbl")), Some("*dog*"))) | ||
| } | ||
|
|
||
| test("show namespaces") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xianyinxin can you add similar parser tests for DELETE/UPDATE as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #110093 has finished for PR 25601 at commit
|
|
retest this please |
|
Test build #110105 has finished for PR 25601 at commit
|
| val CatalogNamespace(maybeCatalog, ns) = namespace | ||
| maybeCatalog match { | ||
| case Some(catalog: SupportsNamespaces) => | ||
| ShowNamespaces(catalog, Some(ns), pattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to write if (ns.nonEmpty) Some(ns) else None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this logic in ShowNamespacesExec. If we move here, there is an implicit contract that if namespace is Some, it should be nonEmpty (meaning I need to add a require check to make it explicit in ShowNamespaceExec).
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
| */ | ||
| case class ShowNamespaces( | ||
| catalog: SupportsNamespaces, | ||
| namespace: Option[Seq[String]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, it's actually catalogAndNamespace, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just namespace since catalog is already resolved in DataSourceResolution.scala.
| /** | ||
| * A SHOW NAMESPACES statement, as parsed from SQL. | ||
| */ | ||
| case class ShowNamespacesStatement(namespace: Option[Seq[String]], pattern: Option[String]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is catalog + namespace, but I followed the same convention as other statements - i.e., CreateTableStatement has tableName instead of catalogAndTableName. Please let me know if you prefer catalogAndNamespace here.
|
LGTM except some code style issues. |
|
Test build #110137 has finished for PR 25601 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowNamespacesExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala
Outdated
Show resolved
Hide resolved
| case Some(catalog: SupportsNamespaces) => | ||
| ShowNamespaces(catalog, Some(ns), pattern) | ||
| case _ => | ||
| throw new AnalysisException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to distinguish between the case where the catalog is None and the catalog does not support namespaces. For the second case, this should report that the catalog doesn't support namespaces. You can also add a conversion method, asNamespaceCatalog to CatalogV2Utils like asTableCatalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using asNamespaceCatalog simplifies the matching. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the current catalog instead of failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the catalog name is specified, but catalog doesn't support namespace, I think we should fail instead of falling back to the current catalog.
It's similar to: if the catalog name is specified, but doesn't contain the table we need, we should fail instead of falling back to the current catalog.
|
Test build #110155 has finished for PR 25601 at commit
|
| DeleteFromTable(aliased, delete.condition) | ||
|
|
||
| case ShowNamespacesStatement(None, pattern) => | ||
| defaultCatalog match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be currentCatalog instead. @cloud-fan, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's implement switching the current catalog first, otherwise we are not able to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 are you working on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am working on USE NAMESPACE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to send out the PR sometime tomorrow.
|
This looks good to me other than the behavior when the catalog is not included in the
It looks like these rules are not universally followed, so I've opened SPARK-29014 to track this clean-up. I'm okay merging this and fixing the catalog defaults in a follow-up. FYI: @cloud-fan, @brkyvz. |
|
@cloud-fan do you have any other comments other than what @rdblue brought up? |
|
Test build #110378 has finished for PR 25601 at commit
|
|
retest this please |
|
Test build #110393 has finished for PR 25601 at commit
|
|
retest this please |
|
Test build #110411 has finished for PR 25601 at commit
|
|
thanks, merging to master! |
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.
### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.
### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
| namespace|
+---------------+
| ns1|
| ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```
### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes apache#25601 from imback82/show_databases.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.
Why are the changes needed?
To support
SHOW DATABASESSQL commands for v2 tables.Does this PR introduce any user-facing change?
spark.sql("SHOW DATABASES")will return namespaces if the default catalog is set:How was this patch tested?
Added unit tests to
DataSourceV2SQLSuite.