-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add duckdb as DataSource - Fixes #14563 #19317
Conversation
needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine)
update _time_grain_expressions
Codecov Report
@@ Coverage Diff @@
## master #19317 +/- ##
==========================================
- Coverage 66.53% 66.50% -0.04%
==========================================
Files 1667 1672 +5
Lines 64360 64564 +204
Branches 6493 6493
==========================================
+ Hits 42824 42936 +112
- Misses 19854 19946 +92
Partials 1682 1682
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except for lint error, LGTM. BTW, I was impressed by DuckDB as a column base lite database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - a few non-blocking style related comments
except RuntimeError: | ||
# Catches the equivalent single-threading error from duckdb. | ||
alive = engine.dialect.do_ping(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we have these in the same except
:
except (sqlite3.ProgrammingError, RuntimeError):
# SQLite can't run on a separate thread, so ``func_timeout`` fails
# RuntimeError catches the equivalent single-threading error from duckdb.
alive = engine.dialect.do_ping(conn)
superset/db_engine_specs/duckdb.py
Outdated
|
||
@classmethod | ||
def get_table_names( | ||
cls, database: "Database", inspector: Inspector, schema: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we add from __future__ import annotations
in the beginning of the file, we can remove the quotes. See an example here:
superset/superset/common/query_context.py
Line 17 in b7ecb14
from __future__ import annotations |
cls, database: "Database", inspector: Inspector, schema: Optional[str] | |
cls, database: Database, inspector: Inspector, schema: Optional[str] |
Thanks for the review @villebro @zhaoyongjie - I'll incorporate your feedback and get the linter passing. |
OK - I am about 90% confident that I am running the linter properly on my local and that all the tests will pass now - looks like I cannot trigger the CI myself, but I think if someone kicks that off we'll see the build go green. |
@rwhaling, CI looks like waiting to finish other tasks. When the CI is all green, I will merge it. Thanks for the following up! |
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in I had one small comment in case it is helpful! DuckDB is indeed embedded in your local process, but it is multi-threaded and can use as many CPU cores as you would like. Thanks for building this connector! |
* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <admin@alitrack.com> Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local> (cherry picked from commit 202e34a)
* + duckdb support needs the forked version of [duckdb-engine](https://github.com/alitrack/duckdb_engine) * Update duckdb.py update _time_grain_expressions * removed superfluous get_all_datasource_names def in duckdb engine spec * added exception handling for duckdb single-threaded RuntimeError * fixed linter blips and other stylistic cleanup in duckdb.py * one last round of linter tweaks in test_connection.py for duckdb support Co-authored-by: Steven Lee <admin@alitrack.com> Co-authored-by: Richard Whaling <richardwhaling@Richards-MacBook-Pro.local>
Hi, testing this and Superset doesn't work when installing
Here's my full Dockerfile
|
@inakianduaga heya, that's on me - unfortunately sqlalchemy doesn't really document what's expected from a dialect, so it's hard to keep up with the changes on their side. I'll push out a fixed version shortly |
@inakianduaga it should work now if you try with duckdb_engine 0.1.11, sorry about that 😅 |
ok thanks. It actually worked for me by going to the fixed |
I wanted to access the S3 parquet files from Superset/SQL Editor. While I am able to use DuckDB to do the same in my python shell, I am wondering how to do it with the Superset and/or duckdb-engine My python snippet to load parquet files from S3: import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'") |
@Mause Could you please help with loading S3 parquet files with your engine? |
if you're having an issue with
PS. @zhaoyongjie any chance you could lock this conversation? |
SUMMARY
Adds Duckdb as an embedded, in-process OLAP db engine.
Duckdb can directly query CSV or Parquet files on disk - eventually, we should be able to query Parquet files directly on S3 as well.
Supersedes #19265
Relying on https://github.com/Mause/duckdb_engine for the SQLAlchemy implementation, and building on top of @alitrack's original work.
Since DuckDB is, much like SQLite, an in-process, single-threaded engine, the error handling in
TestConnectionDatabaseCommand.run
feels a bit weird. Might want to work with @Mause on a narrow exception class for the threading blip.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
pip install duckdb-engine
duckdb:////Users/whoever/path/to/duck.db
duckdb:///:memory:
seems to work?SELECT * from 'test.parquet'
etc.ADDITIONAL INFORMATION