-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuring a model's TTL #1472
Comments
hey @mlavoie-sm360 - thanks for the feature request! We typically recommend handling something like this from an orchestrator. You might be able to approximate this by adding tags to your models (like
You also mentioned that you have source tables that are loaded with varying frequencies. Maybe another option is to use sources to denote the link between your source tables and dbt models? Then you could do:
I think the
Of these, I think the second issue is the bigger concern -- we could figure out some way to store state if it proved valuable for a use case like this. I have a harder time wrapping my head around how the ttl should behave in typical workflow settings. Curious to hear your thoughts on if this logic can be pushed into the orchestrator, or if you think there are other benefits of using a Thanks again! |
Hey @drewbanin - thanks for the feedback! You are right when you talk about approximating this behaviour by using tags and an orchestrator. In fact, that is our current approach to solving this issue, but I know it could be better. Bear with me here. Our orchestrator is Airflow. In our SQL, we use tags to identify when things should be refreshed ( We have a handful of tables that we have identified as candidates for To avoid refreshing It works! But I feel like it could be I shared all of this to add more context to the conversation, hopefully it helps paint a clearer picture for pinning down the need I am hoping this feature would cover. In regards to your two questions, I think they are right on, so let me play ball a little with you and give it a go. QuestionsI'll start with this one: what happens when the model logic changes? Do consumers just get the old version of the table until the ttl expires? Absolutely, things will keep running as they are until the time to live expires. What if I need it to expire right away? Perhaps, using No matter the approach taken, I feel like that question has the same answer, the underlying question is the one I feel we should ask, how do you refresh the table earlier if required? Second: dbt doesn't store any state outside of the database - where should we record the timeout timestamp? Great question, I don't know, I don't know enough about the inner workings of DBT to make a statement on that. Since we are bouncing ideas, I am going to go in a completely different direction then the one I intended in the first place, but what if dbt didn't handle any of this other then to expose the information? What if I could do something like this and ask dbt to list every query tagged
It would then be easy to create a much DRYer task in the orchestrator asking for a list of tables with a given tag and then pulling out the required info to derive the schedule for that task. We would then have increased the flexibility of DBT without increasing it's complexity. DBT's role here would be to provide enough flexibility within it's own framework to be leveraged by other systems to do more complex things without adding more bells and whistles. What are your thoughts? |
Two other ideas regarding the question
|
Hey @mlavoie-sm360 - thanks for the measured and thoughtful response! I think you made an incredibly good point here. What do you think about this?
We're adding a dbt ls command for v0.14.0. This command accepts model selectors, then returns all of the matching resources (eg. models). The feature spec shows some options (docs to come!). You could do something like:
The output is one line per matched resource, so you could pipe this into some process that plucked out a |
I feel confident that would allow us to be DRY and ATOMIC in our setup. I am looking forward to giving this approach a try. |
Hey @mlavoie-sm360 - |
Feature
Feature description
I would like to propose the idea of adding
ttl
as part of a model's optional configurations.Simply put, this would give an indicator of the expected life expectancy of the data. If a
run
is called that involves the said model and it's life expectancy has not expired, the run job can simply ignore the materialization of this table to save time.ttl
could be defined as a unit of time in seconds, in the example below, the model would be refreshed at most once every 3600 seconds (1h)Who will this benefit?
I think this feature would be most useful for larger data pipelines that may have cross dependencies. As a project grows, it is more likely that different datasets require different refresh rates.
For example, you may have one table,
tableA
, that is pulling info from a source table that updates once a day. Another table,tableB
, pulls info from a source table that updates every minute. Assuming both of these tables are required to materialize yet another table,tableC
, we understand logically that it is less than optimal to refreshtableA
at the same rate astableB
andtableC
. In fact, we could leavetableA
alone for a while and still keeptableC
'fresh' for most of the day.ttl
would allow data engineers to provide extra information to prevent unecessary computation and reduce time required to refresh entire data pipelines.The text was updated successfully, but these errors were encountered: