Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent file based catalog #122

Open
alamb opened this issue Sep 9, 2024 · 5 comments
Open

Persistent file based catalog #122

alamb opened this issue Sep 9, 2024 · 5 comments

Comments

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

This is my own personal aspirations / goals for a "file based catalog"

Usecase

Usecase 1: pre-configured EXTERNAL TABLES

I would like to be able to setup some table definitions in dft and then reuse them from session to session

For example

CREATE EXTERNAL TABLE ... STORED AS DELTA TABLE WITH CREDENTIALS ....

And then have this configuration available to any dft session

I believe this usecase is already partly handled by the config file feature. However, there are some other things I would like:

  1. Issue SQL commands to create these tables in a session and then have those tables available in the session without having to copy/paste the DDL into a new file
  2. Store data that is not DDL (such as parquet metadata fetched from a remote object store to save having to access it again)

Usecase 1: ephemeral data

Today when you run queries like this in dft

CREATE TABLE foo(x int);
INSERT INTO foo VALUES (1);
SELECT * FROM foo;

If you start another session of dft foo is gone:

select * from foo;   
│Error during planning: table 'datafusion.public.foo' not found

The issue is that the default catalog in datafusion is an ephemeral file based one so there is no place to store data such as shown above.

Desired Behavior

What I would like is for dft to operate similarly to sqlite or duckdb.

By default, an ephemeral in memory catalog is used and nothing is saved after the session quits

However a database file can be "opened" and if so then all changes made to the catalog are stored in that file. If the file is reopened on a subsequent invocation of the program all the DDL / catalog information is still present

Something like

.open mycatalog.dat
-- foo is now saves into mycatalog.dat
CREATE TABLE foo(x int)
@alamb alamb changed the title Persistent file based catalog aspirations Persistent file based catalo Sep 9, 2024
@alamb alamb changed the title Persistent file based catalo Persistent file based catalog Sep 9, 2024
@matthewmturner
Copy link
Collaborator

Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.

@alamb
Copy link
Contributor Author

alamb commented Sep 10, 2024

Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.

That is an excellent idea -- it would be really sweet to have an excuse to work on that API (I bet it is not used anywhere near as much as it could be)

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2024

I have subsequently learned about the dft config file system that has certain overlap with what is described above

However, one difference is that the config file basically applies to all sessions, where this catalog would be very explicitly selected by a user when running

Thus it seem like the config file would be a great place to put credentials, for example, that you wanted to apply to all catalogs / sessions

@matthewmturner
Copy link
Collaborator

There are two different setups that I think are relevant to this feature:

  • The config which is for app configuration options (everything from s3 credentials, flightsql host details, to rendering (i.e. FPS), and I have several ideas for additional configuration options. As a starting point, I think we need to document the all the parameters that are available to be set in the config. I will create a ticket for this.
  • There is a the datafusionrc file which is simply a SQL file at ~/.datafusion/.datafusionrc that is meant to have DDL that users would want available at start up.

For example, this is mine:

CREATE SCHEMA staging;
CREATE SCHEMA production;

CREATE EXTERNAL TABLE staging.min_aggs STORED AS PARQUET LOCATION 'ny2://sip/aggregates/minute_by_ticker_monthly_v2/year_month=2015-08/data.01.parquet';
CREATE EXTERNAL TABLE staging.trades_v2 STORED AS PARQUET LOCATION 'ny2://atlas/sip/trades_v2/date=2024-08-30/2024-08-30.01.parquet';

CREATE EXTERNAL TABLE "production".min_aggs STORED AS PARQUET LOCATION 'ny5://sip/aggregates/minute_by_ticker_monthly/year_month=2015-08/data.01.parquet';

CREATE TABLE wide AS VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);

I think the setup for this could be improved (its still based on the old setup from a couple years ago). I think moving the file to ~/.config/dft (so all config in same location) and renaming to something like ddl.sql in that directory would be better.

One thing to note, I have been having trouble getting the custom catalog / schemas to work (created an issue for it. Maybe just a user error, but i havent had time to look too deeply into it.

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

I agree that there is an importatnt distinction between "configuration" (with e.g. credentials) that potentially apply to all sessions and the DDL/catalog setup part

I think some systems permit creating credentials that are stored as part of the catalog (CREATE CREDENTIALS ....). This kind of makes sense if you think about the catalog / table definitions as application configuration. However this seems like a substantial security risk to me as well. 🤔

I'll have to mess with it to see what is happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants