Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring a model's TTL #1472

Closed
mlavoie-sm360 opened this issue May 17, 2019 · 6 comments
Closed

Configuring a model's TTL #1472

mlavoie-sm360 opened this issue May 17, 2019 · 6 comments

Comments

@mlavoie-sm360
Copy link

Feature

Feature description

I would like to propose the idea of adding ttl as part of a model's optional configurations.

Simply put, this would give an indicator of the expected life expectancy of the data. If a run is called that involves the said model and it's life expectancy has not expired, the run job can simply ignore the materialization of this table to save time.

ttl could be defined as a unit of time in seconds, in the example below, the model would be refreshed at most once every 3600 seconds (1h)

{{-
  config(
    ttl=3600,
  )
-}}

...

Who will this benefit?

I think this feature would be most useful for larger data pipelines that may have cross dependencies. As a project grows, it is more likely that different datasets require different refresh rates.

For example, you may have one table, tableA, that is pulling info from a source table that updates once a day. Another table, tableB, pulls info from a source table that updates every minute. Assuming both of these tables are required to materialize yet another table, tableC, we understand logically that it is less than optimal to refresh tableA at the same rate as tableB and tableC. In fact, we could leave tableA alone for a while and still keep tableC 'fresh' for most of the day.

ttl would allow data engineers to provide extra information to prevent unecessary computation and reduce time required to refresh entire data pipelines.

@drewbanin
Copy link
Contributor

hey @mlavoie-sm360 - thanks for the feature request!

We typically recommend handling something like this from an orchestrator. You might be able to approximate this by adding tags to your models (like hourly, and nightly), then just re-running those models on a specified interval. That might look like:

# Run this hourly
dbt run --models tag:hourly+

# Run this nightly
dbt run

You also mentioned that you have source tables that are loaded with varying frequencies. Maybe another option is to use sources to denote the link between your source tables and dbt models? Then you could do:

# Run models that depend on Source A hourly
dbt run --models source:sourceA+

# Run models that depends on Source B more frequently
dbt run --model source:sourceB+

I think the ttl you've described here is pretty interesting, but there are a couple of issues I'd like to think through further:

  1. dbt doesn't store any state outside of the database - where should we record the timeout timestamp?
  2. what happens when the model logic changes? Do consumers just get the old version of the table until the ttl expires?

Of these, I think the second issue is the bigger concern -- we could figure out some way to store state if it proved valuable for a use case like this. I have a harder time wrapping my head around how the ttl should behave in typical workflow settings. Curious to hear your thoughts on if this logic can be pushed into the orchestrator, or if you think there are other benefits of using a ttl here.

Thanks again!

@mlavoie-sm360
Copy link
Author

Hey @drewbanin - thanks for the feedback!

You are right when you talk about approximating this behaviour by using tags and an orchestrator. In fact, that is our current approach to solving this issue, but I know it could be better. Bear with me here.

Our orchestrator is Airflow. In our SQL, we use tags to identify when things should be refreshed (daily, twice_a_day, every_3_hours, hourly, near_rt, etc.). These tags are only applied to 'leaf' queries, the ones that don't have children.

We have a handful of tables that we have identified as candidates for TTLing, these are usually part of a schedule that refreshes more often than is really necessary for that table. For example, tableA is part of a flow that refreshes every hour because data in tableB refreshes every hour and they are both required to generate tableC, as I had explained earlier.

To avoid refreshing tableA, we apply a tag:ttl to tableA and our orchestrator applies an --exclude tag:ttl whenenver it runs it's normal scheduled jobs (daily, twice_a_day, etc.). Our orchestrator then has a separate task for tableA with it's own custom schedule. A new task is created manually for each new ttl tagged tables.

It works!

But I feel like it could be simpler DRYer. Every time we identify one of these TTL candidates, two things need to happen, the SQL needs to be tagged as a ttl, a new task in the orchestrator needs to be created to run that table on it's own schedule. This creates a risk, there is no way to know that every ttl has it's matching task. Again, on a small scale that is not an issue, as your pipeline grows, the concerns grow with it. If both of these things could happen in an atomic kind of way, risk would be reduced.

I shared all of this to add more context to the conversation, hopefully it helps paint a clearer picture for pinning down the need I am hoping this feature would cover.

In regards to your two questions, I think they are right on, so let me play ball a little with you and give it a go.

Questions

I'll start with this one: what happens when the model logic changes? Do consumers just get the old version of the table until the ttl expires?

Absolutely, things will keep running as they are until the time to live expires. What if I need it to expire right away? Perhaps, using force-refresh could ignore the ttl flag, like sodbt run -m +tag:daily --force-refresh?

No matter the approach taken, I feel like that question has the same answer, the underlying question is the one I feel we should ask, how do you refresh the table earlier if required?

Second: dbt doesn't store any state outside of the database - where should we record the timeout timestamp?

Great question, I don't know, I don't know enough about the inner workings of DBT to make a statement on that. Since we are bouncing ideas, I am going to go in a completely different direction then the one I intended in the first place, but what if dbt didn't handle any of this other then to expose the information?

What if I could do something like this and ask dbt to list every query tagged ttl and then have an elegant way of also getting the configuration values in a json format for example.

{{-
  config(
    tags=['ttl'],
    my_custom_var='0 0 * * *'
  )
-}}

select * from ...

It would then be easy to create a much DRYer task in the orchestrator asking for a list of tables with a given tag and then pulling out the required info to derive the schedule for that task. We would then have increased the flexibility of DBT without increasing it's complexity.

DBT's role here would be to provide enough flexibility within it's own framework to be leveraged by other systems to do more complex things without adding more bells and whistles.

What are your thoughts?

@mlavoie-sm360
Copy link
Author

Two other ideas regarding the question where should we record the timeout timestamp?

  • Using a mechanism similar to source snapshot-freshness
  • Using a mechanism similar to the example given to create audit tables

@drewbanin
Copy link
Contributor

Hey @mlavoie-sm360 - thanks for the measured and thoughtful response! I think you made an incredibly good point here. What do you think about this?

What if I could do something like this and ask dbt to list every query tagged ttl and then have an elegant way of also getting the configuration values in a json format for example.

We're adding a dbt ls command for v0.14.0. This command accepts model selectors, then returns all of the matching resources (eg. models).

The feature spec shows some options (docs to come!). You could do something like:

$ dbt ls --models tag:ttl --output json
{
    "name": "my_model",
    "resource_type": "model",
    "package_name": "debug",
    ...
    "config": {
        "enabled": true,
        "materialized": "incremental",
        "my_custom_var": "0 0 * * *"
    }
}

The output is one line per matched resource, so you could pipe this into some process that plucked out a config field and operated on it accordingly.

@mlavoie-sm360
Copy link
Author

I feel confident that would allow us to be DRY and ATOMIC in our setup. I am looking forward to giving this approach a try.

@drewbanin
Copy link
Contributor

Hey @mlavoie-sm360 - dbt ls is merged and going live in dbt v0.14.0! Check out the (prerelease) docs here: https://docs.getdbt.com/v0.14/reference#list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants