-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More flexible control over selecting from sources #1592
Comments
Hey @bastienboutonnet - wanted to take some time to think through how we could accomplish this! Thanks for your patience :) Some options: 1. Support filters on source definitionsSee also: #1495 dbt could support the specification of a target-aware filter on sources. When a source is referenced using # schema.yml
version: 2
sources:
- name: snowplow
tables:
- name: event
filter: |
{% if target.name == 'dev' %}
where event_time > current_timestamp - interval '3 days'
{% elif target.name == 'ci' %}
where 1 = 0
{% endif %} Pros:
Cons:
2. Make it possible to augment the return value of
|
Hey @drewbanin, sorry it took so long to get back to you on these. I had to think quite a lot about this and then a few things took over and I had too much on my plate. Anyway, in short "fuck, there's no real clear easy winner". In long: In general, I appreciate your concern with potentially arriving at a situation where you don't really want to let the definition of Option 1:I like the flexibility, I think the fact that this lives in documentation/yml space is nice and clear. Yes it could be abused to become more of a modelling thing which is riskier but I like the transparency and ease of implementation. In many ways it's very close to option 4 except option 4 is quite a bit more involved. I feel that it would indeed help with issues like #1495 and actually partially #1096 although it would not be wrapping per se nor be applicable to any kind of SQL. But I like the separation, the modularity. I don't really like the fact that it would have to be added to all sources in the yml. I also like that it can work for Option 2:At first, sounded nice, but I really don't like it. The macro would get potentially complex, it offers very little modularity it's not great. Option 3:Thats possible of course, but requires all users to refactor all code. And I really think there ought to be some "official" more core built in support for things like being friendly with CI/DEV or partitioning runs. I feel actually that while this defers the responsibility, it can be even more dangerous and lead to pretty unconventional and even not nice exploitations. Option 4:I think this one is the winner in terms of how nice and clean it looks and how explicit modular and flexible it is. And it potentially helps with all the issues I have a feeling, if we make it be possible for sources and refs. The only worry which may demand some testing would be how optimisers feel about all this wrapping. Do you have some experience with testing things like this? In conclusion:I think I would feel most happy with implementing 1 or 4 (given some benchmarking that so much wrapping isn't really going to terribly upset optimisers). I would need some pointers for both with regards to adding some support for the yml parsing. For 4 maybe more pointers but ultimately I think they're both very good because of how clearly things live in the yml. Let me know what you think and if someone could point me to a few things so that I could get started. I'm really eager to implement it I think this could be a really relevant feature given the trend towards bigger and bigger data in most companies. |
I also lean towards 1 and 4. What I don't like about 1 is the complex Jinja in the yml file. Could it accept simply a macro call with a predefined signature (expects certain arguments such as target and source)? The user would then implement the logic in each macro that returns the filter condition. l think it's important to be able to define it at the global level and have it apply to multiple sources. For example, I have a project where we read from many tables that come from a multi tenant SaaS app. Each tenant has a |
Sharing this to help others that might find this Issue, but remain without an actionable solution/approach: After doing some digging in the Slack channel, I found this thread where @jtcohen6 comes to the rescue once again! For my use case (limit data usage when target.name != 'prod'), I was able to just extend the |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Feature
Feature description
This is a follow up on a related issue (#564)
The idea would be to be able to bring
dry-run
like behaviour to Snowflake which does not have a properdry-run
option, unlike BQ, nor an explain statement that could be leveraged to do very fast SQL run verifications.This could potentially be achieved by applying a
select from {{source('foo', 'bar')}} where 1=0
which is blazing fast with probably next to no execution cost.Another use case would be to have a way to limit or sample sources for quicker runs during development where users may benefit validating their SQL and checking some underlying data without having to wait for potentially large table to generate and drive compute costs through the roof.
This could potentially be achieved by using the
TABLESAMPLE
funtionality of Snowflake or a straightforwardLIMIT
wrapping the call to source.Who will this benefit?
People who use Snowflake and want to run fast CI tests to validate SQL in their models and who have large tables which would cause CI/CD to run for very long and/or be very costly.
Very happy to discuss a few things around approaches and to get cracking with a PR down the line.
The text was updated successfully, but these errors were encountered: