Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved support for "User Defined Catalogs" #5291

Open
1 of 2 tasks
alamb opened this issue Feb 15, 2023 · 2 comments
Open
1 of 2 tasks

Improved support for "User Defined Catalogs" #5291

alamb opened this issue Feb 15, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 15, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I think it is a bit confusing now how to use DataFusion with a custom catalog.

Background

DataFusion is primarily a query engine, rather than a complete database system that also must handle persistence, catalog management, ingest, data lifecycle management, and other things.

Systems like Ballista or GreptimeDB are examples of complete systems that use DataFusion for query but have their own catalog implementations.

However, in order to function the query engine needs to read information catalog, and DataFusion provides a rich set of APIs such as the following

The query engine also knows how to plan for Catalog manipulations which often need planner support (e.g. to do type checking or coercion, etc)

Making things even more confusing is that DataFusion does have a basic ephemeral in-memory based catalog implementation, https://docs.rs/datafusion/18.0.0/datafusion/catalog/catalog/struct.MemoryCatalogList.html and the methods on SessionContext know how to modify that memory catalog.

Challenges

The interface and use between the built in catalog support and how to plug in an external catalog are not super clear. For example this PR #5277

Also, as projects like #5130 get under way it becomes even more important to distinguish between catalog manipulations and simply catalog read-only access

Another example is the fact that SessionContext::sql by default modifies the in memory catalog:

https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.sql

Note: This api implements DDL such as CREATE TABLE and CREATE VIEW with in memory default implementations.

If this is not desirable, consider using SessionState::create_logical_plan() which does not mutate the state based on such statements.

Describe the solution you'd like

I would like a clearer interface (or maybe just documentation) that makes it clear what manipulations are allowed and which are not, as well as an example that other people could follow to implement an external catalog. This interface should make it clear what the catalog supports and what it does not (aka does it allow creating new tables or views?)

To do this, I suggest:

This project might also help

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
N/A

@jaylmiller
Copy link
Contributor

I'm actually currently working on figuring out the catalog api and implementing a catalog for my own project. Would be happy to adapt some of my code into an example.

@alamb
Copy link
Contributor Author

alamb commented Feb 17, 2023

That would be awesome @jaylmiller -- thank you very much

alamb pushed a commit that referenced this issue Feb 26, 2023
* catalog example

* add license and example description at top of file

* ddl example

* comment

* cleanup extra code

* clippy

* remove clippy ignore stmt

* better comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants