-
Notifications
You must be signed in to change notification settings - Fork 1.6k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it possible for DBT to generate models from a piece of JINJA code #3469
Comments
@bashyroger Thanks for opening! Apologies for the delay in my response. There's a lot here to chew on. Today, one of the foundational assumptions underlying dbt is that 1 model file = 1 model node = 1 object (view or table) in the database. What's the problem here? Whenever we want a new dbt-managed database object, it requires a PR to add a new model file. That's hardly an automated approach to data warehousing. I've seen three quite different approaches that try to solve this problem:
At the risk of writing much too much here, I want to discuss each approach in some detail. 1. One model, many objectsIn the most common use case, someone wants a large model result—same logic, same columns—to be split in the the database across multiple tables, with perhaps a few parametrized differences across them. There may be a good reason (such as PII) that each subset needs to be isolated from other records, with the same column schema, transformed in the exact same manner. We see this sometimes with marketing or consulting agencies who offer modeled data as a service to their clients. There are parts about this that are tricky for dbt today:
In the meantime, while this functionality isn't natively supported, I've seen valid approaches here go in two different directions:
I wanted to bring up this approach because these have often been part of the same conversation. It's not exactly the issue you describe here, and I don't think it would apply nearly as well to unstructured data that doesn't share a common column schema. 2. One file, many modelsThis is an old, old issue: #184. There was a long time when we were convinced that model blocks the ultimate way to go. They've fallen out of fashion recently, but this vein of thinking was especially popular around the time that we created snapshots (dbt v0.14.0, summer 2019). -- models/many_models.sql
{% model stg_stripe_payments %}
select * from {{ source('stripe', 'payments') }}
where is_deleted = false
{% endmodel %}
{% model net_revenue %}
{{ config(materialized = 'table') }}
select
orders.order_id,
payments.amount - orders.cogs as net_revenue
from {{ ref('stg_stripe_payments') }} as payments
join {{ ref('stg_orders') }} as orders
on payments.order_id = orders.order_id
{% endmodel %} There are a few things that model blocks have going for them:
This may still be a direction we head in, ultimately. We won't get there before dbt v1.0, but who's to say it couldn't be an essential component of dbt v2.0? I bring up model blocks here because the prospect of decoupling 1 model-per-file also suggests the possibility of taking this further, by treating the larger file as a single Jinja template that could produce many model blocks dynamically: -- models/many_models.sql
{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
{% model stg_stripe_{{ source.name }} %} {# how would this work? #}
select * from {{ source(source.source_name, source.name) }}
where is_deleted = false
{% endmodel %}
{% endfor %} If you try to do this with snapshots today, you'll get an error message:
For good reason: This gets really complex to parse, and it makes us even more reliant on Jinja at a time when we're investing in static analysis to speed up parse-time manifest construction. There's an alternative syntax toyed with in #1374 (comment). That issue was about dynamic columns, to avoid repeating the same yaml properties over and over. (That's ground we've since retrod in #2995, among other places, and definitely interested in thinking about more.) There, Drew mentioned the theoretical idea of a -- models/many_models.sql
{% for source in graph.sources.values() | selectattr('source_name', 'equalto', 'stripe') | list %}
{% set model_sql %}
select * from {{ source(source_name, name) }}
{% endset %}
{% set model_name = 'stg_' + source.source_name + '_' + source.name %}
{% do manifest.add_model(name=model_name, sql=model_sql) %}
{% endfor %} While it's easier to imagine writing code like this, it would still be fiendishly complex to handle at parse time. This approach quickly runs into a big limitation in how dbt works today, given these two premises:
So the SQL for each individual model could be dynamically generated, just as it can today, based on the latest content or metadata in the database. But it would not be possible to create more or fewer models on the basis on information stored in the database—exactly the sort of use case that your issue is getting at. Fuller capabilities would require big foundational changes to the way that dbt works. 3. Code generationAfter all, maybe dbt cannot be—ought not be—a data warehouse automater itself, but merely the best-possible substrate for "true" DWH automation? I feel this gets at the heart of the matter. In my early years using dbt, I thought the extensibility and flexibility of its tooling were its strongest features; with Jinja, all things are possible, certainly compared to SQL. Nowadays, I'm inclined to think that dbt's greatest strenght is the rigidity of its opinionated framework, its demand for decisiveness and verbosity in the places that matter. The guardrails are there for a reason. So, let's say we keep living in a world where one database object has to be one model has to be one file (or at least one Jinja block). An automated process could still be of tremendous help by generating those files; open PRs; run CI checks; require a human reviewer (or not); merge; revert if necessary. In that world, data warehouse automation is a time-lapse photograph: a double-speed verson of the current git-based dbt development workflow, with human intervention at only the appropriate moments. The key difference is that data warehouse automation is decoupled from data warehouse execution—their handoff point is dbt model code, explicitly defined. What should that automated process look like? Should the premier executor of dbt code also try to be its own code generator? Or should there be a separate tool? There's fair arguments on either side:
Why not both? tl;drEventually, I could see us doing pieces of all three of the things mentioned above:
Ultimately, I find the possibilities offered by the third approach most compelling. Having written 1500 words on dbt and model automation, I'm coming around to the idea that a native dbt task (#1082), or even just a way to plug together |
A masterful play in three acts ! Thank you Jeremy. |
We built workflows like this at Aula and we'll be talking about it at Coalsece 2021! Jeremy's description, above, is an elegant summary of what our automation does. Source schema files, snapshots, and the first staging layer are all auto-generated and auto-maintaining. The automation relies on the following components:
I didn't know about dbt-helper until just now, so I'm interested to take a look and see how importing dbt-core directly might help standardize the methods and make the code more usable for others. Assuming my coworker from Aula is ok with it, I'd also love to contribute it somewhere in an open-source way. I see somewhat frequent posts over in the dbt slack from people looking for similar functionality. |
Great discussion. If I get it right, approaches 1 and 2 above do not currently work, correct? Even if we end up with just approach 3, a "template" or reference implementation would be handy. |
My team would also like this feature! |
I'm curious if the ability from Jinja2 to extend from another template and replace predefined blocks of code (template inheritance) was considered for this feature? Then a user could define something like |
We would still really love this feature! |
would love this feature! |
Hi @Gwildor, there is an old feature request about jinja2 template inheritance : #1337 |
My use case would be for a multi-tenant platform where we create an access layer (view) for every tenant. I would like to configure a list of tenant identifiers and generate a view for each tenant using one generic jinja template. |
Thanks for your extensive comment @jtcohen6 , I have been giving your writings a bit more thoughs
I still think it can / should be partially. Partially by indeed introducing 'template models': models with a recursive (Jinja) loop that can spawn the creation of multiple models files. From what I can infer about your writing, this initial request would largely be fulfilled if #3428 were implemented. Regarding you comment on parse time / vs execution time problem: those template models would not have to run at execution time initially. As a developer, I would purely see them as a way to automate the creation of multiple model files from a template, something that the current codegen package indeed cannot do. Practically, I would expect a command to exists like: Then later, further along the road, I would see them be added to the execution context:
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the feature
While I like DBT a lot, it us currently quite limited in supporting true metadata based 'data warehouse automation'. Sure, a lot of things are automated for you in DBT:
However, what is lacking is the possibility for DBT to:
Describe alternatives you've considered
At my current client we store our raw data in a snowflake variant column, with the schema metadata of that column in a separate variant column.
We do this to keep loading data simple, to never ever be affected by source schema changes anymore in the raw data zone.
To actually use this data however, we have written a piece of python code that parses the stored schema metadata to generate a series of DBT models (materialized as views) on top of the raw data tables.
While this all works fine, this code is obviously disconnected from DBT which currently hinders development as it requires code / environment / context switching.
Additional context
Who will this benefit?
Everyone that wants to do more extensive data warehouse automation with / using DBT.
Are you interested in contributing this feature?
I am not a solid programmer, but definitely would like to contribute in the requirements, use cases and beta test this kind of functionality
The text was updated successfully, but these errors were encountered: