-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a simple plugin system for loading data from external sources #141
Conversation
@JCZuurmond curious for your take here, as my inspiration for this. ;-) |
raise Exception("Source config did not indicate a method to open a GSheet to read") | ||
|
||
sheet = None | ||
if "worksheeet" in source_config.meta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe triple eee
is a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay, thank you! What I get for not testing this plugin out yet ;-)
I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking |
Yeah this is a great q b/c I've been wondering about how far I can push this pattern. Right now, I think that these plugins run single-threaded as part of the graph compilation piece of the dbt execution, so I didn't want to make it too easy to inject things that are arbitrarily slow/complicated in here (e.g., pulling in a lot of data from the GH API) b/c that really should be done externally to dbt-duckdb using one of the many excellent EL frameworks out there which can e.g. run multiple threads/processes, restart from failure, be aware of things like API rate limits, etc., etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwills : This is some great stuff! Very flexible and powerful.
dbt/adapters/duckdb/environments.py
Outdated
assert df is not None | ||
handle = self.handle() | ||
cursor = handle.cursor() | ||
cursor.execute(f"CREATE OR REPLACE TABLE {source_config.table_name()} AS SELECT * FROM df") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this DuckDB trick! Great way of generalising the approach
|
||
def load(self, source_config: SourceConfig): | ||
ext_location = source_config.meta["external_location"] | ||
ext_location = ext_location.format(**source_config.as_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows dynamically inserting values from the config using a format string. Could you give an example of how this is intended to be used? Does this, for example, contain the seed directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so the idea here is that you can define these meta properties in one of two places, either a) the meta
tag for the top-level source, or b) on the meta tag for the individual tables listed underneath the source. So a common pattern folks use when they have a bunch of external tables in slightly different locations is to define an f-string template external_location
meta property on the top-level source and then use the properties of the individual tables to render the template into a specific file. This issue demonstrates the pattern well:
...and then there's an even more advanced version of the pattern here: #116
dbt/adapters/duckdb/plugins/excel.py
Outdated
ext_location = source_config.meta["external_location"] | ||
ext_location = ext_location.format(**source_config.as_dict()) | ||
source_location = pathlib.Path(ext_location.strip("'")) | ||
return pd.read_excel(source_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add fetching the sheet_name
from the config
@@ -42,5 +46,16 @@ def submit_python_job(self, handle, parsed_model: dict, compiled_code: str) -> A | |||
handle.cursor().execute(json.dumps(payload)) | |||
return AdapterResponse(_message="OK") | |||
|
|||
def get_binding_char(self) -> str: | |||
return "%s" | |||
def load_source(self, plugin_name: str, source_config: utils.SourceConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what the buenavista
class does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a common dev workflow problem that folks run into using dbt-duckdb is that they want to be able to simultaneously update and run their dbt models in an IDE while querying the relations that the dbt run generates using a BI/DB query tool (DBeaver, Superset, etc.) DuckDB's execution model makes this pattern hard to do b/c if any process has the write lock on the underlying DB file, then no other process is allowed to read the file, so you end up with this awkwardness where you need to switch back and forth between dbt owning the file and the query tool owning the file.
Buena Vista started out as my attempt to solve this problem: it's a Python server that speaks the Postgres (and more recently, Presto) protocols and takes ownership of the DuckDB file so that multiple processes can operate on it at the same time. To make BV work with dbt-duckdb, I created a notion of environments, so that dbt-duckdb could distinguish between cases where it should execute everything against the local Python process (i.e., how things work normally) or whether it should execute the code against a remote Python process that is speaking the Postgres protocol (i.e., BV.) I suspect that in the not too distant future there will be more types of remote server environments that folks will want to be able to execute their DuckDB queries against (an early example here: (https://github.com/boilingdata/boilingdata-http-gw ) and I want them to be able to use dbt-duckdb as their adapter without having to write their own. /cc @dforsber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sounds really good 👍🏻. We would be happy to integrate/create/help writing plugin for BoilingData too.
…e profiles_config_update, or something
Fixes #137 and addresses some longstanding roadmap items (esp. to make it easier to use dbt-duckdb with data in Iceberg/delta tables.)
The idea here is that there are certain kinds of external sources (e.g., Excel files, Google Sheets, or Iceberg tables), where we need to execute a little bit of custom Python code to transform the data that the source contains into something that DuckDB can consume as a table (like a pandas/polars DataFrame, or a PyArrow table/dataset.)
One way to include data from these sources in a dbt-duckdb project is to use Python models, but I've noted in my own work that this is fairly tedious to do because it involves:
profiles.yml
file that is designed for exactly that sort of thing.The idea here is to create a plugin system for dbt-duckdb (which is, somewhat awkwardly, itself a plugin for dbt-core) which lets us define the code we need to extract data from these external systems in a way that makes them accessible to DuckDB and our downstream transformation logic. To support this, we are using some special
meta
tags on the source (ala theexternal_location
trick we use now for CSV/Parquet files) as arguments to the plugin'sload
function, which then returns a data object that DuckDB knows how to convert into a table for use by the rest of the pipeline.You can think of these plugins as a way to support "elT" use cases (as opposed to dbt's standard "ELT" workloads, where you really do want a high-powered extracting/loading system at your disposal), where the extraction/loading work is simple and safe enough that a small python script is all we want/need.