Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to support SQL Standard Catalogs? #281

Closed
superdupershant opened this issue Feb 1, 2022 · 2 comments
Closed

How to support SQL Standard Catalogs? #281

superdupershant opened this issue Feb 1, 2022 · 2 comments
Labels
enhancement New feature or request Stale

Comments

@superdupershant
Copy link

Spark is switching over to a three level catalog hierarchy based on the official ISO/ANSI SQL Standard. This means, instead of
db_name.table_name there will be an additional layer above of organization above what currently exists. This introduces a couple changes:

  • While SCHEMA had long been an alias for the the term DATABASE we'll start to officially use SCHEMA as the official term.
  • The top level namespace will be a collection of SCHEMAs called a CATALOG. So a fully qualified table name becomes catalog_name.schema_name.table_name.
  • Like the current database there will be a default catalog that users connect to and they'll be able to switch between catalogs by running USE CATALOG foo.
  • There will be certain reserved schema names that exist in all catalogs, the most common example being INFORMATION_SCHEMA
==============
= Catalog(s) =
==============
     |
=============
= Schema(s) =
=============
     |
=============================
= Tables(s) or Functions(s) =
=============================

How should we support the catalog namespace in dbt?

There exists a connection configuration option named database that acts similarly to the SQL Standard's definition of a catalog. Historically though the term database was synonymous with schema, and as often been a source of confusion. Some potential options:

  1. Support a new config option catalog that specifies the CATALOG to use for the project, leave the database option currently as is functioning as an alias for schema.
  2. Don't add a catalog config option and just use the value in database to refer to a catalog within the system.
  3. Support a new config option catalog and make database an alias for catalog.

With option (3) above we would want some intermediate period that warns anyone using the database config option without specifying a schema config that its meaning is in the process of changing.

@superdupershant superdupershant added enhancement New feature or request triage labels Feb 1, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2022

@superdupershant Love it! Thanks for opening the issue, and for your work advocating for this change.

Option (3) makes the most sense to me. With a minor inversion: We should make catalog an alias for database, so that users can think in terms of catalogs and dbt's internals can think in terms of databases. This will represent a much lower lift.

Questions from me:

  • What's the anticipated version of Spark that will include this functionality? What's the best way to ascertain which version of Spark someone is using? Should we offer an additional profile config (three_layer_namespace: true), or just assume that if the user specifies catalog/database, they know what they're doing?
  • Is there any functionality (or any limitations) that will be unique to catalogs? I assume it will be possible to query "across" catalogs, in the same way users can query "across" schemas today
  • What will be the implications for metadata queries? We know that the query dbt uses today for caching + cataloging performs poorly at scale, and we've been discussing alternatives ([CT-202] Workaround for some limitations due to list_relations_without_caching method #228). This is a place where I'd especially appreciate your team's input.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 1, 2022
@github-actions github-actions bot closed this as completed Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

2 participants