dbt python sdk #7173

aranke · 2023-03-15T10:00:40Z

aranke
Mar 15, 2023
Maintainer

Hi all,

Historically, dbt-core has supported two ways of providing input: files containing “dbt code,” and CLI commands to execute them. Today, these are its exclusive officially supported APIs.

Over the past few months, the Core team has been working on a programmatic (Python) entry-point for dbt-core that looks & feels like the CLI. It will be possible to call top-level dbt commands from a Python program, and pass in parameters, at parity with what you can do with dbt on the command line. We’ll be releasing the first cut of this API for programmatic invocations in v1.5 next month.

Through that process, we’ve discussed what a second programmatic interface to dbt beyond the CLI could feel like—something that goes beyond what the CLI is capable of.

Why now? Why not before? There are tangible risks in evolving dbt-core from an opinionated workflow into a more-flexible library — we risk exposing the wrong interfaces, or removing guardrails that have been put in place for good reason. We haven’t seen a demonstrated need for it either, apart from requests from “superusers” in the community.

A few reasons that emerged from our discussion, about why now is the right time to start working on a second programmatic interface for dbt-core:

It’s difficult to integrate dbt into production-grade data workflows at scale—whether those workflows are running via custom orchestration, or via dbt Cloud. To get the flexibility they require, power users must take hard dependencies on brittle implementation details like CLI flags, graph queue, and the selection spec¹ to jury-rig integrations.
While our metadata interfaces have gotten us far—artifacts like manifest.json or run_results.json, and real-time information available in structured logging—these aren’t flexible enough to support future growth. Reading a hundred MB JSON file at the end of a run is inherently less scalable than incrementally yielding a few KB when a node finishes running.
One more thing: While it hasn’t been our priority in the past, we aspire to enable a true plugin interface to enable new differentiated functionality. Use cases that would be a great fit for plugins include: linters (e.g., SQLFluff), loggers (e.g., DataDog), static analyzers, synthetic node generators, and more. The sweet spot for a plugin is functionality that isn’t universal enough to live in dbt-core², and cannot be written in “dbt code”—it requires more power than what’s currently available via packages.

First, we’re interested in exposing more flexible inputs for documented & contracted interfaces, which could support providing them as Python objects, in addition to source (SQL/Python) and config (YAML) file contents. This would include:

model contents & configurations
runtime configuration (settable today via env vars or CLI flags)
data warehouse credentials (both local and via cloud services)
node selectors

We’re interested in unbundling large monolithic objects, such as “manifest” and “runtime configuration,” in favor of smaller composable pieces where possible. For example, yielding each node’s result as it finishes running.

Over time, we’re interested in potentially exposing new interfaces, as polymorphic Python objects:

Adapters (including compute, storage, connection credentials, relational cache, database-specific config)
Node Runner (including compiler, executor)
Manifest Loader (including source files, project config, profile config, env vars, CLI vars, partial parsing state)

As part of this effort, the major interface we do not plan to expose in its entirety is the DAG. There is more power in declaratively defining what nodes dbt is running, versus imperatively telling dbt (or another tool) how to run them. This is a similar pattern to frameworks like React that allow users to define their components but manages how those components are run.

Ask from you, the reader:

Which of these use cases resonate with you?
Are there plugins out in the world you’d like to build or see built?
Is there something else we should include in this discussion?

Thanks, Kshitij (with help from @jtcohen6)

See Who Let the DAGs out by Tim Leonard of Flyte as a representative example that has me both impressed and horrified. ↩
Broad strokes, I’m thinking about whether the functionality is useful to 80%-90% of dbt users. ↩

chamini2 · 2023-04-05T19:54:51Z

chamini2
Apr 5, 2023

I work in the fal tool I think this is great work towards building letting other tools to build on top of dbt. Really appreciate the effort!

I have a questio regarding

As part of this effort, the major interface we do not plan to expose in its entirety is the DAG. There is more power in declaratively defining what nodes dbt is running, versus imperatively telling dbt (or another tool) how to run them. This is a similar pattern to frameworks like React that allow users to define their components but manages how those components are run.

What this means is someone using the sdk would not be able to modify the DAG to be run, right? dbt would just have the dbt project itself (the refs inside of a model, etc) as the reference on how to build said DAG. If so, I would agree this is the right approach.

I wonder about adding "dynamic" nodes to the DAG, maybe just by adding models with refs (and adding extra refs to existing models) and then dbt building and running the resulting DAG.

3 replies

aranke Apr 6, 2023
Maintainer Author

What this means is someone using the sdk would not be able to modify the DAG to be run, right? dbt would just have the dbt project itself (the refs inside of a model, etc) as the reference on how to build said DAG.

Yes, correct.

If so, I would agree this is the right approach.

Awesome, thanks for validating!

I wonder about adding "dynamic" nodes to the DAG, maybe just by adding models with refs (and adding extra refs to existing models) and then dbt building and running the resulting DAG.

This is the big picture idea, but definitely open to feedback around what the best way to implement this.

Thanks for taking the time to read through and your feedback!

chamini2 Apr 7, 2023

I wonder about adding "dynamic" nodes to the DAG, maybe just by adding models with refs (and adding extra refs to existing models) and then dbt building and running the resulting DAG.
This is the big picture idea, but definitely open to feedback around what the best way to implement this.

I think this may fall into modifying the DAG. But maybe adding downstream elements to certain nodes, rather than removing a specific edge from it.

vglocus Aug 19, 2024

My team works with providing dbt for hundreds of users at our company and try to integrate it with our existing data infrastructure as well as making it easier to work with at scale (hiding complexity).

In order to do this we wrap dbt in a python environment were we can make the necessary steps before and after invoking actual dbt ( python interface dbt.invoke() ).

We use vanilla (not forked or patched) dbt-core and dbt-bigquery.

Some of the things I would like to be able to do is:

Get invokation id from the results object, currently have to parse the run_results.json. Not terrible but feels a little hacky and potential race condition (parallel invokations?).

Before execution get the parsed DAG, or at least which models/nodes that will be included in the imminent execution. Currently parse does not work with a selector, so translating the selector into a list of nodes to be executed on a major project (where compile is not a few seconds but significantly larger) is not trivial. Currently we would have to compile before with the same selector, look at run results to get a list of nodes, assume run will get the same nodes.

Alter project and profile settings after dbt has parsed it but before execution for example

change timeout settings towards the warehouse (e.g during an incident)
alter GCP project a query is executed in (not where the result is persisted)
modify query_comment settings in order to always (without requirements on user provided config) in order to label every query with system provided metadata (for monitoring the fleet, attributing costs, etc)

Currently we attempt to read the yamls, understand them as well as we can and then try to modify them in a reliable way before execution. However we need to understand how to resolve complex settings like profile: "{{ env_var('DBT_PROFILE', 'scheduled') }}" which dbt can do internally.

The current dream would be to have something equivalent to manifest = dbt.invoke(['parse']) but including parsing of profiles and project settings, with selector, and get access to these settings and then being able to modify these pre execution.

We would be very happy to discuss this and contribute some effort in order to impove on the "dbt python interface".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt python sdk #7173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

dbt python sdk #7173

aranke Mar 15, 2023 Maintainer

Footnotes

Replies: 1 comment · 3 replies

chamini2 Apr 5, 2023

aranke Apr 6, 2023 Maintainer Author

chamini2 Apr 7, 2023

vglocus Aug 19, 2024

aranke
Mar 15, 2023
Maintainer

Replies: 1 comment 3 replies

chamini2
Apr 5, 2023

aranke Apr 6, 2023
Maintainer Author