Create a simple plugin system for loading data from external sources #141

jwills · 2023-04-08T04:33:11Z

Fixes #137 and addresses some longstanding roadmap items (esp. to make it easier to use dbt-duckdb with data in Iceberg/delta tables.)

The idea here is that there are certain kinds of external sources (e.g., Excel files, Google Sheets, or Iceberg tables), where we need to execute a little bit of custom Python code to transform the data that the source contains into something that DuckDB can consume as a table (like a pandas/polars DataFrame, or a PyArrow table/dataset.)

One way to include data from these sources in a dbt-duckdb project is to use Python models, but I've noted in my own work that this is fairly tedious to do because it involves:

a lot of trivial and boilerplate Python that isn't very much fun to write, and
some tricky and/or non-obvious code to allow the Python models to access/operate on shared resources (e.g., a Google client or an Iceberg catalog) that requires some amount of configuration info that is really best contained inside of the profiles.yml file that is designed for exactly that sort of thing.

The idea here is to create a plugin system for dbt-duckdb (which is, somewhat awkwardly, itself a plugin for dbt-core) which lets us define the code we need to extract data from these external systems in a way that makes them accessible to DuckDB and our downstream transformation logic. To support this, we are using some special meta tags on the source (ala the external_location trick we use now for CSV/Parquet files) as arguments to the plugin's load function, which then returns a data object that DuckDB knows how to convert into a table for use by the rest of the pipeline.

You can think of these plugins as a way to support "elT" use cases (as opposed to dbt's standard "ELT" workloads, where you really do want a high-powered extracting/loading system at your disposal), where the extraction/loading work is simple and safe enough that a small python script is all we want/need.

jwills · 2023-04-10T04:14:28Z

@JCZuurmond curious for your take here, as my inspiration for this. ;-)

buremba · 2023-04-10T12:25:34Z

dbt/adapters/duckdb/plugins/gsheet.py

+            raise Exception("Source config did not indicate a method to open a GSheet to read")
+
+        sheet = None
+        if "worksheeet" in source_config.meta:


maybe triple eee is a typo?

yay, thank you! What I get for not testing this plugin out yet ;-)

buremba · 2023-04-10T12:33:17Z

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

jwills · 2023-04-10T15:58:58Z

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

Yeah this is a great q b/c I've been wondering about how far I can push this pattern. Right now, I think that these plugins run single-threaded as part of the graph compilation piece of the dbt execution, so I didn't want to make it too easy to inject things that are arbitrarily slow/complicated in here (e.g., pulling in a lot of data from the GH API) b/c that really should be done externally to dbt-duckdb using one of the many excellent EL frameworks out there which can e.g. run multiple threads/processes, restart from failure, be aware of things like API rate limits, etc., etc.

… the gsheet plugin

JCZuurmond

@jwills : This is some great stuff! Very flexible and powerful.

JCZuurmond · 2023-04-12T08:04:21Z

dbt/adapters/duckdb/environments.py

+        assert df is not None
+        handle = self.handle()
+        cursor = handle.cursor()
+        cursor.execute(f"CREATE OR REPLACE TABLE {source_config.table_name()} AS SELECT * FROM df")


I love this DuckDB trick! Great way of generalising the approach

JCZuurmond · 2023-04-12T08:07:33Z

dbt/adapters/duckdb/plugins/excel.py

+
+    def load(self, source_config: SourceConfig):
+        ext_location = source_config.meta["external_location"]
+        ext_location = ext_location.format(**source_config.as_dict())


This allows dynamically inserting values from the config using a format string. Could you give an example of how this is intended to be used? Does this, for example, contain the seed directory?

Yeah, so the idea here is that you can define these meta properties in one of two places, either a) the meta tag for the top-level source, or b) on the meta tag for the individual tables listed underneath the source. So a common pattern folks use when they have a bunch of external tables in slightly different locations is to define an f-string template external_location meta property on the top-level source and then use the properties of the individual tables to render the template into a specific file. This issue demonstrates the pattern well:

#127

...and then there's an even more advanced version of the pattern here: #116

JCZuurmond · 2023-04-12T08:08:36Z

dbt/adapters/duckdb/plugins/excel.py

+        ext_location = source_config.meta["external_location"]
+        ext_location = ext_location.format(**source_config.as_dict())
+        source_location = pathlib.Path(ext_location.strip("'"))
+        return pd.read_excel(source_location)


Let's add fetching the sheet_name from the config

JCZuurmond · 2023-04-12T08:13:27Z

dbt/adapters/duckdb/buenavista.py

@@ -42,5 +46,16 @@ def submit_python_job(self, handle, parsed_model: dict, compiled_code: str) -> A
        handle.cursor().execute(json.dumps(payload))
        return AdapterResponse(_message="OK")

-    def get_binding_char(self) -> str:
-        return "%s"
+    def load_source(self, plugin_name: str, source_config: utils.SourceConfig):


Could you explain what the buenavista class does?

So a common dev workflow problem that folks run into using dbt-duckdb is that they want to be able to simultaneously update and run their dbt models in an IDE while querying the relations that the dbt run generates using a BI/DB query tool (DBeaver, Superset, etc.) DuckDB's execution model makes this pattern hard to do b/c if any process has the write lock on the underlying DB file, then no other process is allowed to read the file, so you end up with this awkwardness where you need to switch back and forth between dbt owning the file and the query tool owning the file.

Buena Vista started out as my attempt to solve this problem: it's a Python server that speaks the Postgres (and more recently, Presto) protocols and takes ownership of the DuckDB file so that multiple processes can operate on it at the same time. To make BV work with dbt-duckdb, I created a notion of environments, so that dbt-duckdb could distinguish between cases where it should execute everything against the local Python process (i.e., how things work normally) or whether it should execute the code against a remote Python process that is speaking the Postgres protocol (i.e., BV.) I suspect that in the not too distant future there will be more types of remote server environments that folks will want to be able to execute their DuckDB queries against (an early example here: (https://github.com/boilingdata/boilingdata-http-gw ) and I want them to be able to use dbt-duckdb as their adapter without having to write their own. /cc @dforsber

Hey, sounds really good 👍🏻. We would be happy to integrate/create/help writing plugin for BoilingData too.

…rks locally

…e profiles_config_update, or something

…ring

…table

jwills added 7 commits April 6, 2023 09:53

Fix this bit up

446e6ae

plugins WIP

417773a

WIP: source plugins

390a264

commit those bits too

50e4c76

Simplify the sources test a bit

328c9af

Okay I think that's all of the basics

5c2319f

Only install pyiceberg in the plugins test until we stop support for 3.7

58eafaf

buremba reviewed Apr 10, 2023

View reviewed changes

jwills added 5 commits April 10, 2023 10:11

Fixing bugs and adding a test case (that only runs on my machine) for…

e017af8

… the gsheet plugin

Merge branch 'master' into jwills_plugins

b9d0768

Add in a SQLAlchemy plugin

4f69a8c

this is why gpt3.5 isn't allowed to write code anymore

aad97bf

Add in sqlalchemy dev dep

fe8b9d1

JCZuurmond approved these changes Apr 12, 2023

View reviewed changes

JCZuurmond reviewed Apr 12, 2023

View reviewed changes

JCZuurmond mentioned this pull request Apr 12, 2023

Create a simple plugin system for writing data to external destinations #143

Closed

jwills added 7 commits April 12, 2023 10:31

really shouldn't ever let gpt3.5 write code, lesson learned

0ada396

try to get some more detail on what is going wrong here since this wo…

a48d223

…rks locally

rm unused import

0dc41b8

Apparently you can't have multiple tests in a suite that both overrid…

3f0a153

…e profiles_config_update, or something

Add in the sheet_name parameter and some more test checks

7acb680

Add in iceberg tests and some additional functionality for scan filte…

1ab241f

…ring

Add the option to materialize a plugin source as a view instead of a …

3e094de

…table

jwills merged commit f665487 into master Apr 14, 2023

jwills deleted the jwills_plugins branch April 14, 2023 04:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a simple plugin system for loading data from external sources #141

Create a simple plugin system for loading data from external sources #141

jwills commented Apr 8, 2023

jwills commented Apr 10, 2023

buremba Apr 10, 2023

jwills Apr 10, 2023

buremba commented Apr 10, 2023 •

edited

Loading

jwills commented Apr 10, 2023

JCZuurmond left a comment

JCZuurmond Apr 12, 2023

JCZuurmond Apr 12, 2023

jwills Apr 12, 2023

JCZuurmond Apr 12, 2023

JCZuurmond Apr 12, 2023

jwills Apr 12, 2023

dforsber Apr 12, 2023

Create a simple plugin system for loading data from external sources #141

Create a simple plugin system for loading data from external sources #141

Conversation

jwills commented Apr 8, 2023

jwills commented Apr 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buremba commented Apr 10, 2023 • edited Loading

jwills commented Apr 10, 2023

JCZuurmond left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

buremba commented Apr 10, 2023 •

edited

Loading