Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Model Reference and depends_on hint #2716

Closed
brittianwarner opened this issue Aug 19, 2020 · 14 comments
Closed

Dynamic Model Reference and depends_on hint #2716

brittianwarner opened this issue Aug 19, 2020 · 14 comments
Labels
enhancement New feature or request

Comments

@brittianwarner
Copy link

brittianwarner commented Aug 19, 2020

Describe the feature

I want to be able to dynamically union models depending on runtime variables. For example, we have Customers A, B, C. Customer A has models 1,2,3. Customer B has models 1,2. Customer C has models 2,3. I need to be able to dynamically union these models together, so when I call dbt run for Customer A I want to union models 1,2,3; Customer B union models 1,2; and Customer C union models 2,3. I am currently running into an issue where DBT is asking for a depends_on hint. It is important that I can dynamically change the dependencies on a model given variables. Open to any suggestions/alternatives!

dbt was unable to infer all dependencies for the model "combined_performance".
This typically happens when ref() is placed within a conditional block.

To fix this, add the following hint to the top of the model "combined_performance":

-- depends_on: {{ ref('stg_google_performance') }}

> in macro get_ad_relation (macros/get_customer_attributes.sql)
> called by model combined_performance (models/ads/combined_performance.sql)
> called by model combined_performance (models/ads/combined_performance.sql)

Example of dbt run:
dbt run --vars '{"Customer":"A"}'
dbt run --vars '{"Customer":"B"}'
dbt run --vars '{"Customer":"C"}'

Describe alternatives you've considered

None

Additional context

Using Snowflake.

Who will this benefit?

Anyone who is building a dbt model for many customers/variations.

Are you interested in contributing this feature?

Yes

@brittianwarner brittianwarner added enhancement New feature or request triage labels Aug 19, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 19, 2020

Thanks for the walkthrough @brittianwarner. I don't think this is going to be possible today. dbt needs to be able to construct the dependency graph without stepping through Jinja conditional statements, hence the -- depends_on hint. While we may someday do the work to support single-pass Jinja rendering, that's a massive undertaking and we'll track progress there separately.

In the meantime, here are the options I see:

  1. Changing your data model: In general, I recommend creating unioned versions of models 1, 2, and 3 that include data for all customers. This saves a lot of duplicated transformation logic, and the complexity of separate var-driven invocations. Then, you filter at the very last step by the customer that you need in the moment. By that point, you likely don't need a dbt var; it would be much more powerful as a BI parameter, or logic encoded into a data sharing layer.

  2. Placeholder tables: Create empty source tables + models for customers who do not have a given dataset. This could be as simple as

create customer_b_source.google_performance as (select * from customer_a_source.google_performance limit 0);

I know I recommended something similar in #2673.

Does either of those make sense in your case?

@jtcohen6 jtcohen6 removed the triage label Aug 19, 2020
@brittianwarner
Copy link
Author

Thanks, @jtcohen6! Unfortunately, it would be a huge headache to build different versions of the model for each Customer as we could in theory have 20 models and we would have to support any combination of those depending on the customer. Let me think a bit more about your second recommendation and I will get back to you.

In general, I feel it would be extremely powerful to be able to set dependencies based on a set of variables at runtime. Our use case involves having a dbt job for 100+ customers that all have their own "gotchas" so having this flexibility would be awesome.

Another idea... Is there a way to manually add a dependency via a macro or something?

@jtcohen6
Copy link
Contributor

Is there a way to manually add a dependency via a macro or something?

What have you got in mind?

@brittianwarner
Copy link
Author

I haven't fully thought through it, but it would be nice to insert/update/delete dependencies before any models execute. So I am picturing some sort of pre-run operation where I can change a dependency based on a variable or lookup table. So for my example, I would pass a customer_id to 'dbt run' and then I would have a macro that would lookup configuration data and set dependencies on the fly. Thoughts?

@brittianwarner
Copy link
Author

brittianwarner commented Aug 19, 2020

I think I may have figured it out... I went ahead and added all of the depends_on hints at the top of the model and built a macro to build the sql needed... Here is an example:

{% macro get_relation()  %}
{% set sql %}
select distinct 
	concat('stg_',replace(model,'-','_'),'_performance')   as model_type
from model_lookup a 
where a.model ilike any ('model1','model2','model3','model4','model5')
and customer_id = {{ var('customer_id') }}
{% endset %}

{% set relations = run_query(sql) %}

{% set relations_list = [] %}
{% if execute %}
{% for relation in relations.columns[0].values() %}
  select 
      *
  from {{ ref(relation) }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}
{% endif %}

{% endmacro %}

{{ 
	config(materialized='table') 
}}
-- depends_on: {{ ref('stg_model1_performance') }}
-- depends_on: {{ ref('stg_model2_performance') }}
-- depends_on: {{ ref('stg_model3_performance') }}
-- depends_on: {{ ref('stg_model4_performance') }}
-- depends_on: {{ ref('stg_model5_performance') }}

{{ get_relation()}}

@brittianwarner
Copy link
Author

@jtcohen6 Making sure you saw this.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 20, 2020

That's a neat approach! I think the critical piece is that you're including the following, outside of any conditional logic:

-- depends_on: {{ ref('stg_model1_performance') }}
-- depends_on: {{ ref('stg_model2_performance') }}
-- depends_on: {{ ref('stg_model3_performance') }}
-- depends_on: {{ ref('stg_model4_performance') }}
-- depends_on: {{ ref('stg_model5_performance') }}

Even if you don't end up selecting from all these models each time, dbt is able to build the dependency graph and establish them as parents (well, potential parents). Then, your macro can tell you which ones you actually need for a given customer.

The database lookup feels like an unnecessary complexity, though that depends on how many potential customer-models variations you have and how often those mappings change. If the number of mappings is relatively stable and limited in number, I'd think about an approach like:

{% macro customer_model_mapping()  %}

  {% set mappings = [
    {"customer": "A", "model_types": ('model1', 'model2', 'model3')},
    {"customer": "B", "model_types": ('model1', 'model2')},
    {"customer": "C", "model_types": ('model2', 'model3')} 
  ] %}

  {% do return(mappings) %}

{% endmacro %}

{% macro get_models_for_customer(customer_id) %}

  {% set customer_dict = customer_model_mapping()|selectattr("customer", customer_id) %}
  {% set models_for_customer = customer_dict['model_types'] %}

  {% do return(models_for_customer) %}

{% endmacro %}

Whether you encode the variable into your macros, or pass it as an argument from your model, is totally up to you:

{{ 
	config(materialized='table') 
}}
-- depends_on: {{ ref('stg_model1_performance') }}
-- depends_on: {{ ref('stg_model2_performance') }}
-- depends_on: {{ ref('stg_model3_performance') }}
-- depends_on: {{ ref('stg_model4_performance') }}
-- depends_on: {{ ref('stg_model5_performance') }}

{% set model_list = get_models_for_customer(var('customer_id')) %}

{% for model_name in model_list %}
{% set relation = `stg_` ~ model_name ~ 'performance' %}

  select 
      *
  from {{ ref(relation) }}
{% if not loop.last -%} union all {%- endif %}
{% endfor %}
{% endif %}

I'm going to close this issue. It's a complex problem, and there are (a number of) complex approaches already possible today :)

@john-joseph-noonan
Copy link

Hi folk. I'm still facing this challenge.

I need to be able to dynamically generate a list of depends_on models as my use-case scales rather quickly.

Does anyone have any updates on better ways to do this?

@ivoalbrecht
Copy link

@john-joseph-noonan

Did you find a solution?

I have a model where I union all models that start with some_prefix dynamically:

WITH unioned_messages AS (

    {% for model in graph.nodes.values() if model.name.startswith('some_prefix_') %}
        SELECT * FROM {{ref(model.name)}}
        {% if not loop.last %} 
        UNION ALL
        {% endif %}
    {% endfor %}

)

You can force dependencies, but I haven't found a way to do that dynamically as well

@jtcohen6 Perhaps you know how to approach that?

@john-joseph-noonan
Copy link

@john-joseph-noonan

Did you find a solution?

I have a model where I union all models that start with some_prefix dynamically:

WITH unioned_messages AS (

    {% for model in graph.nodes.values() if model.name.startswith('some_prefix_') %}
        SELECT * FROM {{ref(model.name)}}
        {% if not loop.last %} 
        UNION ALL
        {% endif %}
    {% endfor %}

)

You can force dependencies, but I haven't found a way to do that dynamically as well

@jtcohen6 Perhaps you know how to approach that?

Unfortunately, it seems like it's not possible. I searched through the dbt Slack channels and every arrives on the same conclusion, that it's not yet possible.

@ivoalbrecht
Copy link

@john-joseph-noonan
Did you find a solution?
I have a model where I union all models that start with some_prefix dynamically:

WITH unioned_messages AS (

    {% for model in graph.nodes.values() if model.name.startswith('some_prefix_') %}
        SELECT * FROM {{ref(model.name)}}
        {% if not loop.last %} 
        UNION ALL
        {% endif %}
    {% endfor %}

)

You can force dependencies, but I haven't found a way to do that dynamically as well
@jtcohen6 Perhaps you know how to approach that?

Unfortunately, it seems like it's not possible. I searched through the dbt Slack channels and every arrives on the same conclusion, that it's not yet possible.

Thanks

I did not solve it either, I used a workaround where I do not use the ref function and reference the models that finish the latest in the run via -- depends_on:.

This makes sure that the model runs after the unioned models (at least for now and likely as well if new models are added)

@stevenconnorg
Copy link

stevenconnorg commented Feb 3, 2023

I found a way around this by adding this to the top of my file to dynamically generate the depends_on comment:


{% for model in[<list_of_model_names] %}
{% set depends_on = "--depends_on: {{  ref( '" ~  model ~ "' )  }}" %}
{{  depends_on  }}
{% endfor %}

compiled as:


--depends_on: {{  ref( '<model name 1>' )  }}


--depends_on: {{  ref( '<model name 2>' )  }}


--depends_on: {{  ref( '<model name 3>' )  }}

It's not the prettiest but it works

@NicolasPA
Copy link
Contributor

NicolasPA commented Mar 14, 2023

I have a similar issue with a dynamic dependency, but in a generic test that takes as a parameter a macro referring to other tables.
The error is triggered by the dependency on those tables.

This adds the complexity that graph.nodes doesn't exist at compilation but does at execution, so I have to add a condition for that to not fail at test compilation.

{% test is_count_inferior_to_all_N_M_versions(model, all_versions_macro_name) %}

{% if execute %}
    {% for model in graph.nodes.values() if '_all_versions_v' in model.name %}
        {{ log("-- depends_on: {{ ref('" + model.name + "') }}") }}
        {% set depends_on = "-- depends_on: {{ ref( '" + model.name + "' ) }}" %}
        {{ depends_on }}
    {% endfor %}
{% endif %}

with all_versions as (

    select
        1 as pk,
        count(*) as all_versions_count
    from (
        -- We can't pass a macro directly in the YML so we get it from the context, which only exists during execution
        {% if execute %}
            {% set all_versions_macro = context.get(all_versions_macro_name) %}
            {{ all_versions_macro() }}
        {% else %}
            select 1  -- default value for compile to succeed outside of execution
        {% endif %}
    ) as a
    where a.action_type in ('N', 'M')

),

last_version as (

    select
        1 as pk,
        count(*) as last_version_count
    from {{ model }}

)

select
    all_versions_count,
    last_version_count,
    all_versions_count - last_version_count as diff
from last_version
inner join all_versions
    on last_version.pk = all_versions.pk
where last_version_count > all_versions_count

{% endtest %}

Sadly, it doesn't work. I still get the compilation error.

I know the string generation is correct because if I simply copy-past what the lines printed by the log function and add them to my test definition, it works. But I don't want to do that because the concerned models will evolve in the future, and I don't want to have to change those comments manually every time (and probably forget about it at some point).

The macro call also uses a context trick, but it works fine.

@jelstongreen
Copy link

We've run into this issue again and again over the last few years - dynamic dependency seems like such a useful feature, can we reopen this please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants