Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a simple plugin system for loading data from external sources #141

Merged
merged 19 commits into from
Apr 14, 2023

Conversation

jwills
Copy link
Collaborator

@jwills jwills commented Apr 8, 2023

Fixes #137 and addresses some longstanding roadmap items (esp. to make it easier to use dbt-duckdb with data in Iceberg/delta tables.)

The idea here is that there are certain kinds of external sources (e.g., Excel files, Google Sheets, or Iceberg tables), where we need to execute a little bit of custom Python code to transform the data that the source contains into something that DuckDB can consume as a table (like a pandas/polars DataFrame, or a PyArrow table/dataset.)

One way to include data from these sources in a dbt-duckdb project is to use Python models, but I've noted in my own work that this is fairly tedious to do because it involves:

  1. a lot of trivial and boilerplate Python that isn't very much fun to write, and
  2. some tricky and/or non-obvious code to allow the Python models to access/operate on shared resources (e.g., a Google client or an Iceberg catalog) that requires some amount of configuration info that is really best contained inside of the profiles.yml file that is designed for exactly that sort of thing.

The idea here is to create a plugin system for dbt-duckdb (which is, somewhat awkwardly, itself a plugin for dbt-core) which lets us define the code we need to extract data from these external systems in a way that makes them accessible to DuckDB and our downstream transformation logic. To support this, we are using some special meta tags on the source (ala the external_location trick we use now for CSV/Parquet files) as arguments to the plugin's load function, which then returns a data object that DuckDB knows how to convert into a table for use by the rest of the pipeline.

You can think of these plugins as a way to support "elT" use cases (as opposed to dbt's standard "ELT" workloads, where you really do want a high-powered extracting/loading system at your disposal), where the extraction/loading work is simple and safe enough that a small python script is all we want/need.

@jwills
Copy link
Collaborator Author

jwills commented Apr 10, 2023

@JCZuurmond curious for your take here, as my inspiration for this. ;-)

raise Exception("Source config did not indicate a method to open a GSheet to read")

sheet = None
if "worksheeet" in source_config.meta:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe triple eee is a typo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay, thank you! What I get for not testing this plugin out yet ;-)

@buremba
Copy link

buremba commented Apr 10, 2023

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

@jwills
Copy link
Collaborator Author

jwills commented Apr 10, 2023

I wonder how useful it would be to combine it with Singer taps to support a number of different sources out of the box. It seems feasible by mocking singer.write_schema and singer.write_records and converting the data to DataFrame internally but I didn't try it myself.

Yeah this is a great q b/c I've been wondering about how far I can push this pattern. Right now, I think that these plugins run single-threaded as part of the graph compilation piece of the dbt execution, so I didn't want to make it too easy to inject things that are arbitrarily slow/complicated in here (e.g., pulling in a lot of data from the GH API) b/c that really should be done externally to dbt-duckdb using one of the many excellent EL frameworks out there which can e.g. run multiple threads/processes, restart from failure, be aware of things like API rate limits, etc., etc.

Copy link

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwills : This is some great stuff! Very flexible and powerful.

assert df is not None
handle = self.handle()
cursor = handle.cursor()
cursor.execute(f"CREATE OR REPLACE TABLE {source_config.table_name()} AS SELECT * FROM df")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this DuckDB trick! Great way of generalising the approach


def load(self, source_config: SourceConfig):
ext_location = source_config.meta["external_location"]
ext_location = ext_location.format(**source_config.as_dict())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows dynamically inserting values from the config using a format string. Could you give an example of how this is intended to be used? Does this, for example, contain the seed directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the idea here is that you can define these meta properties in one of two places, either a) the meta tag for the top-level source, or b) on the meta tag for the individual tables listed underneath the source. So a common pattern folks use when they have a bunch of external tables in slightly different locations is to define an f-string template external_location meta property on the top-level source and then use the properties of the individual tables to render the template into a specific file. This issue demonstrates the pattern well:

#127

...and then there's an even more advanced version of the pattern here: #116

ext_location = source_config.meta["external_location"]
ext_location = ext_location.format(**source_config.as_dict())
source_location = pathlib.Path(ext_location.strip("'"))
return pd.read_excel(source_location)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add fetching the sheet_name from the config

@@ -42,5 +46,16 @@ def submit_python_job(self, handle, parsed_model: dict, compiled_code: str) -> A
handle.cursor().execute(json.dumps(payload))
return AdapterResponse(_message="OK")

def get_binding_char(self) -> str:
return "%s"
def load_source(self, plugin_name: str, source_config: utils.SourceConfig):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what the buenavista class does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a common dev workflow problem that folks run into using dbt-duckdb is that they want to be able to simultaneously update and run their dbt models in an IDE while querying the relations that the dbt run generates using a BI/DB query tool (DBeaver, Superset, etc.) DuckDB's execution model makes this pattern hard to do b/c if any process has the write lock on the underlying DB file, then no other process is allowed to read the file, so you end up with this awkwardness where you need to switch back and forth between dbt owning the file and the query tool owning the file.

Buena Vista started out as my attempt to solve this problem: it's a Python server that speaks the Postgres (and more recently, Presto) protocols and takes ownership of the DuckDB file so that multiple processes can operate on it at the same time. To make BV work with dbt-duckdb, I created a notion of environments, so that dbt-duckdb could distinguish between cases where it should execute everything against the local Python process (i.e., how things work normally) or whether it should execute the code against a remote Python process that is speaking the Postgres protocol (i.e., BV.) I suspect that in the not too distant future there will be more types of remote server environments that folks will want to be able to execute their DuckDB queries against (an early example here: (https://github.com/boilingdata/boilingdata-http-gw ) and I want them to be able to use dbt-duckdb as their adapter without having to write their own. /cc @dforsber

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sounds really good 👍🏻. We would be happy to integrate/create/help writing plugin for BoilingData too.

@jwills jwills merged commit f665487 into master Apr 14, 2023
@jwills jwills deleted the jwills_plugins branch April 14, 2023 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt-duckdb should know about Excel
4 participants