Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible control over selecting from sources #1592

Closed
bastienboutonnet opened this issue Jul 6, 2019 · 5 comments
Closed

More flexible control over selecting from sources #1592

bastienboutonnet opened this issue Jul 6, 2019 · 5 comments
Labels
stale Issues that have gone stale

Comments

@bastienboutonnet
Copy link
Contributor

bastienboutonnet commented Jul 6, 2019

Feature

Feature description

This is a follow up on a related issue (#564)

The idea would be to be able to bring dry-run like behaviour to Snowflake which does not have a proper dry-run option, unlike BQ, nor an explain statement that could be leveraged to do very fast SQL run verifications.
This could potentially be achieved by applying a select from {{source('foo', 'bar')}} where 1=0 which is blazing fast with probably next to no execution cost.

Another use case would be to have a way to limit or sample sources for quicker runs during development where users may benefit validating their SQL and checking some underlying data without having to wait for potentially large table to generate and drive compute costs through the roof.
This could potentially be achieved by using the TABLESAMPLE funtionality of Snowflake or a straightforward LIMIT wrapping the call to source.

Who will this benefit?

People who use Snowflake and want to run fast CI tests to validate SQL in their models and who have large tables which would cause CI/CD to run for very long and/or be very costly.

Very happy to discuss a few things around approaches and to get cracking with a PR down the line.

@drewbanin
Copy link
Contributor

drewbanin commented Jul 20, 2019

Hey @bastienboutonnet - wanted to take some time to think through how we could accomplish this! Thanks for your patience :)

Some options:

1. Support filters on source definitions

See also: #1495

dbt could support the specification of a target-aware filter on sources. When a source is referenced using {{ source(...) }}, dbt could auto-include this filter. This might look like:

# schema.yml

version: 2

sources:
  - name: snowplow
    tables:
      - name: event
         filter: |
            {% if target.name == 'dev' %}
               where event_time > current_timestamp - interval '3 days'
            {% elif target.name == 'ci' %}
               where 1 = 0
            {% endif %}

Pros:

Cons:

  • unsure if it makes sense to bake this logic into dbt's definition of what a "source" is
    • This feels like more of a "modelling" task than a source definition task IMO
  • you'd need to specify a filter like this for every source table (potentially very many)
    • I'm disinclined to make some sort of "global" source filter -- that sounds messy

2. Make it possible to augment the return value of source()

See also: #1603

dbt could make it possible to tap into the definition of the source macro. This macro could intelligently return different SQL based on the source that's being referenced. We could do something similar for ref, which could possibly address #1603

I can imagine that the logic in this macro could reference the target, or the config of the source, (eg. we could add a source config like sample: <bool>, or similar.

{% macro render_source(source_relation, source_config) %}
  {% if source_config.sample %}
    (select * from {{ source_relation }} where 1 = 0) as __dbt_source_{{ source_relation.name }}
  {% else %}
    {{ source_relation }}
  {% endif %}
{% endmacro %}

Pros:

  • Also very flexible - the user can control how source relations are interpolated in fine-grained way
  • In addition to filtering, users could include/exclude certain columns, add metadata columns (like the git sha, invocation id, etc)
  • Partially addresses Allow configurable include_policy for snowflake #1603

Cons:

  • I imagine this macro could become pretty complex if it embeds logic for all of the sources in a project
  • This feels pretty brittle - should users have the ability to override something as core to dbt as source() and ref()?
  • We'd need to determine if the way that dbt compiles projects would support a workflow like this. You couldn't reference a source() in render_source, as that would probably recurse forever. Are there other subtleties like this which would make implementing this macro challenging?

3. Defer to userspace

Users can already create a macro which calls the source() macro! You could do something like:

{% macro smart_source(source_name, table_name) %}
  {% if target.name == 'ci' %}
    (select * from {{ source(source_name, table_name }} where 1 = 0) as __dbt_source_{{ table_name }}
  {% else %}
    {{ source(source_name, table_name) }}
  {% endif %}
{% endmacro %}

Then in all of your relevant models, you can do:

select * 
from {{ smart_source(source_name, table_name) }}

In a CI environment, this would add the 1=0 filter to all references to dbt sources. One thing we could do is make it possible to override source directly. I don't think you can do that currently because the definition of the source macro masks dbt's internal source definition, but that shouldn't be too hard to work around.

Pros:

Cons:

  • should we encourage users to override the source macro? While flexible, sounds like it could lead to really subtle and hard to identify bugs!

4. Add a mechanism for "wrapping" SQL (#1096)

Last, we could implement something like #1096 to "wrap" the SQL generated for models & sources. This feels more well-defined for models than for sources IMO, but there's probably some good analog. I can imagine:

# dbt_project.yml

sources:
  wrap_sql: [filter_ci, filter_dev]

These filter_ci and filter_dev macros would look similar to the examples shown above. Ideally, sources and models would be tagged with metadata that would let these "middleware" macros operate on columns. The filter_dev macro would probably benefit from some model/source config which defines the timestamp field to filter the model/source on, for instance. Adding the 1=0 filter is a lot easier, as we only need to know about the target name.

Pros:

Cons:

  • Probably the most involved/difficult to implement option shown here
  • Might be tricky to figure out how to filter on a timestamp (eg. filter for the last 3 days of data in dev)
  • This might generate some particularly gnarly SQL, but unsure how important that is
    • My one concern is that db optimizers might have a hard time with all these subqueries

I just outlined a bunch of different options here. There's a lot of overlap between all of them, but they all have their own pros and cons. Curious to know:

  1. if you agree with the stated pros/cons (or if i missed anything)
  2. if you have any alternative ideas for how this might work
  3. which (if any) of these you think is the best way to proceed

Looking forward to hearing what you think!

@bastienboutonnet
Copy link
Contributor Author

Hey @drewbanin, sorry it took so long to get back to you on these. I had to think quite a lot about this and then a few things took over and I had too much on my plate.

Anyway, in short "fuck, there's no real clear easy winner". In long:

In general, I appreciate your concern with potentially arriving at a situation where you don't really want to let the definition of source() slip. Personally, I think I am very ok with it given good documentation and clear "Best Practices" on it. Something like adding warning or argumentation not to abuse source-level filters for doing very modelling specific stuff, but, rather, that it should be used to leverage things like partitioning, clustering, of filtering to ease with "dev" and CI runs. I think no matter which ends up being picked I think that that would apply and I'm not really worried because from my experience my team and I like following best practices and adhering to things that make sense to do in certain places. That being said I can imagine that it opens the door for a lot of slippage.

Option 1:

I like the flexibility, I think the fact that this lives in documentation/yml space is nice and clear. Yes it could be abused to become more of a modelling thing which is riskier but I like the transparency and ease of implementation. In many ways it's very close to option 4 except option 4 is quite a bit more involved. I feel that it would indeed help with issues like #1495 and actually partially #1096 although it would not be wrapping per se nor be applicable to any kind of SQL. But I like the separation, the modularity. I don't really like the fact that it would have to be added to all sources in the yml. I also like that it can work for ref() ideally whatever we pick should be able to append such filter to both sources and refs otherwise it would feel incomplete and unbalanced.

Option 2:

At first, sounded nice, but I really don't like it. The macro would get potentially complex, it offers very little modularity it's not great.

Option 3:

Thats possible of course, but requires all users to refactor all code. And I really think there ought to be some "official" more core built in support for things like being friendly with CI/DEV or partitioning runs. I feel actually that while this defers the responsibility, it can be even more dangerous and lead to pretty unconventional and even not nice exploitations.

Option 4:

I think this one is the winner in terms of how nice and clean it looks and how explicit modular and flexible it is. And it potentially helps with all the issues I have a feeling, if we make it be possible for sources and refs. The only worry which may demand some testing would be how optimisers feel about all this wrapping. Do you have some experience with testing things like this?

In conclusion:

I think I would feel most happy with implementing 1 or 4 (given some benchmarking that so much wrapping isn't really going to terribly upset optimisers). I would need some pointers for both with regards to adding some support for the yml parsing. For 4 maybe more pointers but ultimately I think they're both very good because of how clearly things live in the yml.

Let me know what you think and if someone could point me to a few things so that I could get started. I'm really eager to implement it I think this could be a really relevant feature given the trend towards bigger and bigger data in most companies.

@pedromachados
Copy link

I also lean towards 1 and 4.

What I don't like about 1 is the complex Jinja in the yml file. Could it accept simply a macro call with a predefined signature (expects certain arguments such as target and source)? The user would then implement the logic in each macro that returns the filter condition.

l think it's important to be able to define it at the global level and have it apply to multiple sources.

For example, I have a project where we read from many tables that come from a multi tenant SaaS app. Each tenant has a database_id. It would be useful to be able to define a global filter that would keep a subset of the IDs in order to speed up the development run time.

@marcellovictorino
Copy link

marcellovictorino commented Jul 7, 2021

Sharing this to help others that might find this Issue, but remain without an actionable solution/approach:

After doing some digging in the Slack channel, I found this thread where @jtcohen6 comes to the rescue once again!

For my use case (limit data usage when target.name != 'prod'), I was able to just extend the source functionality, by using the bultin.source functionality. Link to documentation

@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants