More flexible control over selecting from sources #1592

bastienboutonnet · 2019-07-06T15:56:40Z

Feature

Feature description

This is a follow up on a related issue (#564)

The idea would be to be able to bring dry-run like behaviour to Snowflake which does not have a proper dry-run option, unlike BQ, nor an explain statement that could be leveraged to do very fast SQL run verifications.
This could potentially be achieved by applying a select from {{source('foo', 'bar')}} where 1=0 which is blazing fast with probably next to no execution cost.

Another use case would be to have a way to limit or sample sources for quicker runs during development where users may benefit validating their SQL and checking some underlying data without having to wait for potentially large table to generate and drive compute costs through the roof.
This could potentially be achieved by using the TABLESAMPLE funtionality of Snowflake or a straightforward LIMIT wrapping the call to source.

Who will this benefit?

People who use Snowflake and want to run fast CI tests to validate SQL in their models and who have large tables which would cause CI/CD to run for very long and/or be very costly.

Very happy to discuss a few things around approaches and to get cracking with a PR down the line.

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-07-20T18:46:07Z

Hey @bastienboutonnet - wanted to take some time to think through how we could accomplish this! Thanks for your patience :)

Some options:

1. Support filters on source definitions

2. Make it possible to augment the return value of `source()`

3. Defer to userspace

Users can already create a macro which calls the source() macro! You could do something like:

{% macro smart_source(source_name, table_name) %}
  {% if target.name == 'ci' %}
    (select * from {{ source(source_name, table_name }} where 1 = 0) as __dbt_source_{{ table_name }}
  {% else %}
    {{ source(source_name, table_name) }}
  {% endif %}
{% endmacro %}

Then in all of your relevant models, you can do:

select * 
from {{ smart_source(source_name, table_name) }}

In a CI environment, this would add the 1=0 filter to all references to dbt sources. One thing we could do is make it possible to override source directly. I don't think you can do that currently because the definition of the source macro masks dbt's internal source definition, but that shouldn't be too hard to work around.

Pros:

strictly in userspace
already possible, with minimal dbt code changes required to allow users to override source directly
applicable to refs too (Allow configurable include_policy for snowflake #1603, Allow partition filters to be passed to source freshness checks #1495)
We can package up helper macros into dbt-utils, for instance, which provide compelling functionality out of the box (though I suppose this is true for both options above too)

Cons:

should we encourage users to override the source macro? While flexible, sounds like it could lead to really subtle and hard to identify bugs!

4. Add a mechanism for "wrapping" SQL (#1096)

Last, we could implement something like #1096 to "wrap" the SQL generated for models & sources. This feels more well-defined for models than for sources IMO, but there's probably some good analog. I can imagine:

# dbt_project.yml

sources:
  wrap_sql: [filter_ci, filter_dev]

These filter_ci and filter_dev macros would look similar to the examples shown above. Ideally, sources and models would be tagged with metadata that would let these "middleware" macros operate on columns. The filter_dev macro would probably benefit from some model/source config which defines the timestamp field to filter the model/source on, for instance. Adding the 1=0 filter is a lot easier, as we only need to know about the target name.

Pros:

Implements Add a mechanism for "wrapping" SQL across materializations #1096 which is compelling in its own right
You can have multiple "wrapping" macros, so you can encapsulate particular logic for filtering in dev or CI envs separately

Cons:

Probably the most involved/difficult to implement option shown here
Might be tricky to figure out how to filter on a timestamp (eg. filter for the last 3 days of data in dev)
This might generate some particularly gnarly SQL, but unsure how important that is
- My one concern is that db optimizers might have a hard time with all these subqueries

I just outlined a bunch of different options here. There's a lot of overlap between all of them, but they all have their own pros and cons. Curious to know:

if you agree with the stated pros/cons (or if i missed anything)
if you have any alternative ideas for how this might work
which (if any) of these you think is the best way to proceed

Looking forward to hearing what you think!

bastienboutonnet · 2019-08-10T09:38:51Z

Hey @drewbanin, sorry it took so long to get back to you on these. I had to think quite a lot about this and then a few things took over and I had too much on my plate.

Anyway, in short "fuck, there's no real clear easy winner". In long:

In general, I appreciate your concern with potentially arriving at a situation where you don't really want to let the definition of source() slip. Personally, I think I am very ok with it given good documentation and clear "Best Practices" on it. Something like adding warning or argumentation not to abuse source-level filters for doing very modelling specific stuff, but, rather, that it should be used to leverage things like partitioning, clustering, of filtering to ease with "dev" and CI runs. I think no matter which ends up being picked I think that that would apply and I'm not really worried because from my experience my team and I like following best practices and adhering to things that make sense to do in certain places. That being said I can imagine that it opens the door for a lot of slippage.

Option 1:

I like the flexibility, I think the fact that this lives in documentation/yml space is nice and clear. Yes it could be abused to become more of a modelling thing which is riskier but I like the transparency and ease of implementation. In many ways it's very close to option 4 except option 4 is quite a bit more involved. I feel that it would indeed help with issues like #1495 and actually partially #1096 although it would not be wrapping per se nor be applicable to any kind of SQL. But I like the separation, the modularity. I don't really like the fact that it would have to be added to all sources in the yml. I also like that it can work for ref() ideally whatever we pick should be able to append such filter to both sources and refs otherwise it would feel incomplete and unbalanced.

Option 2:

At first, sounded nice, but I really don't like it. The macro would get potentially complex, it offers very little modularity it's not great.

Option 3:

Thats possible of course, but requires all users to refactor all code. And I really think there ought to be some "official" more core built in support for things like being friendly with CI/DEV or partitioning runs. I feel actually that while this defers the responsibility, it can be even more dangerous and lead to pretty unconventional and even not nice exploitations.

Option 4:

I think this one is the winner in terms of how nice and clean it looks and how explicit modular and flexible it is. And it potentially helps with all the issues I have a feeling, if we make it be possible for sources and refs. The only worry which may demand some testing would be how optimisers feel about all this wrapping. Do you have some experience with testing things like this?

In conclusion:

I think I would feel most happy with implementing 1 or 4 (given some benchmarking that so much wrapping isn't really going to terribly upset optimisers). I would need some pointers for both with regards to adding some support for the yml parsing. For 4 maybe more pointers but ultimately I think they're both very good because of how clearly things live in the yml.

Let me know what you think and if someone could point me to a few things so that I could get started. I'm really eager to implement it I think this could be a really relevant feature given the trend towards bigger and bigger data in most companies.

pedromachados · 2019-08-15T12:46:42Z

I also lean towards 1 and 4.

What I don't like about 1 is the complex Jinja in the yml file. Could it accept simply a macro call with a predefined signature (expects certain arguments such as target and source)? The user would then implement the logic in each macro that returns the filter condition.

l think it's important to be able to define it at the global level and have it apply to multiple sources.

For example, I have a project where we read from many tables that come from a multi tenant SaaS app. Each tenant has a database_id. It would be useful to be able to define a global filter that would keep a subset of the IDs in order to speed up the development run time.

marcellovictorino · 2021-07-07T14:24:17Z

Sharing this to help others that might find this Issue, but remain without an actionable solution/approach:

After doing some digging in the Slack channel, I found this thread where @jtcohen6 comes to the rescue once again!

For my use case (limit data usage when target.name != 'prod'), I was able to just extend the source functionality, by using the bultin.source functionality. Link to documentation

github-actions · 2022-01-04T01:51:30Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

drewbanin mentioned this issue Jul 24, 2019

Allow configurable include_policy for snowflake #1603

Closed

github-actions bot added the stale Issues that have gone stale label Jan 4, 2022

github-actions bot closed this as completed Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible control over selecting from sources #1592

More flexible control over selecting from sources #1592

bastienboutonnet commented Jul 6, 2019 •

edited

Loading

drewbanin commented Jul 20, 2019 •

edited

Loading

bastienboutonnet commented Aug 10, 2019

pedromachados commented Aug 15, 2019

marcellovictorino commented Jul 7, 2021 •

edited

Loading

github-actions bot commented Jan 4, 2022

More flexible control over selecting from sources #1592

More flexible control over selecting from sources #1592

Comments

bastienboutonnet commented Jul 6, 2019 • edited Loading

Feature

Feature description

Who will this benefit?

drewbanin commented Jul 20, 2019 • edited Loading

1. Support filters on source definitions

2. Make it possible to augment the return value of source()

3. Defer to userspace

4. Add a mechanism for "wrapping" SQL (#1096)

bastienboutonnet commented Aug 10, 2019

Option 1:

Option 2:

Option 3:

Option 4:

In conclusion:

pedromachados commented Aug 15, 2019

marcellovictorino commented Jul 7, 2021 • edited Loading

github-actions bot commented Jan 4, 2022

bastienboutonnet commented Jul 6, 2019 •

edited

Loading

drewbanin commented Jul 20, 2019 •

edited

Loading

2. Make it possible to augment the return value of `source()`

marcellovictorino commented Jul 7, 2021 •

edited

Loading