[CT-2967] [Feature] run dbt in --sample mode #8378

MichelleArk · 2023-08-12T04:10:21Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Introduce a new boolean flag, --sample/--no-sample, that informs whether dbt should run in 'sample' mode.

In sample mode, dbt would resolve ref and source identifiers as select * from <identifier> limit <sample-size>.

A default sample size could be used (perhaps 100?), or set in config at the model-level to inform the size of the sample to select when referencing the model.

models:
  - name: my_model
     config: 
       sample: 
         size: 10

This would also enable setting a sample size using the hierarchical config syntax in dbt_project.yml:

--- dbt_project.yml
models: 
  <resource-path>:
    + sample: {<dictionary>}

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from <identifier> where <filters> limit <sample-size>. This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

We may also want to create an equivalent env_var, so folks could set a sample for a given environment

Describe alternatives you've considered

It's possible to implement something like this with overrides of ref & source macros, but that makes dbt upgrades more difficult.

Also considered a simpler version of this where --sample is an integer argument representing the sample size in #8337. This doesn't leave design space for configuring filters on model/source samples however.

Who will this benefit?

any dbt user looking for speedier (at the cost of completeness/accuracy) development & CI runs!

Are you interested in contributing this feature?

yep!

Anything else?

Took a first pass at this in a spike here: #8337

The text was updated successfully, but these errors were encountered:

jaypeedevlin · 2023-08-14T01:47:16Z

Hey @MichelleArk, excited about this enhancement, as it intersects with some planned work that we were going to execute using ref/source overrides!

We definitely would love to see this in a really configurable form, like the where config you've suggested, or even additionally an order by (ie order by random()). Essentially, the more customizable the better from our perspective!

jtcohen6 · 2023-08-14T10:00:44Z

Thanks for opening @MichelleArk! :)

I love how simple the --sample boolean flag is, with all the actual configuration defined in code.

Question: Is this sufficient in practice? Will users need the ability to define multiple different sampling strategies for the same models, to run in different environment (dev vs. CI vs. QA vs. ...)?

If so: Should that look like lots of vars / env vars within the sample config block? Or, should it look like different "named" sample strategies that can be referenced on the CLI?

I'd offer a few more arguments in favor of supporting column-specific filters:

Pick a few representative segments (customers/regions/etc) in development & CI → sense-check in downstream dashboarding ("last-month revenue for Bouches-du-Rhône isn't looking quite right")
For time-series data, limit based on recency (e.g. "last 3 days"), rather than data scattered across time
Benefit from partition pruning on DWHs (like BQ) where that's the only sure way to achieve better time/cost savings

This could look like, defining additional attributes on columns in a model's properties:

# models/properties.yml
models:
  - name: my_model
    columns:
      - name: customer_id
        sample: [10, 30]  # array -> 'in' filter
      - name: created_at
        sample: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

Then, anywhere you write:

-- models/another_model.sql
with my_model as (
    select * from {{ ref('my_model') }}
), ...

dbt would actually resolve that as a subquery like:

-- models/another_model.sql
with my_model as (
    select * from (
       select * from <database>.<schema>.my_model
       where customer_id in (10, 30)
          and created_at > '2023-08-11'  # writing this on Aug 14
    ) subq
), ...

Or, it could also be a config that's set for many models at once (or even every model in the project):

# dbt_project.yml
models:
  my_project:
    +sample:
      customer_id: [10, 30]  # array -> 'in' filter
      created_at: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

The latter option would require dbt to check the set of columns first, to see if fields named customer_id and created_at actually exist (case-insensitive of course). As dbt is templating out each source / ref call, it would figure out if the model/source being referenced has those columns. I think we could do that in one of two ways:

If those models have enforced contracts → save an introspective DWH query
If those models already exist in the development schema (or defer to prod) → introspective DWH query

Bonus considerations:

Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.
Making sure this works for Python models (!), where the filters might need to be applied via filter(...).filter(...) (snowpark, pyspark, pandas) instead of where ... and ....

@jaypeedevlin I'm happy to hear that this piques your interest! I'm curious - what would be your motivation for order by random()? I'm thinking it's the exact opposite of what I'm describing above: give me a truly random smattering of data, which is unlikely to match up with other models in downstream joins / dashboarding.

QuentinCoviaux · 2023-08-14T11:40:04Z

Love this topic!

I also agree that column-specific filters predicates might be more interesting for consistency across models or even just on its own. If you have years of data in your model, I feel it might be more relevant to get everything from the past month rather than random data points scattered - although probably depends on the use cases and what you are trying to test!

I'm wondering if we could customize where to apply this filter though. As I understand it, if one sets this in a .yml file then I assume it would be executed at the end of the model whereas it might be better performance-wise to run it early - especially if you have big tables and a lot of transformations in a model.

We have been doing something similar, on the first group of CTEs of certain models where we simply select from source or ref, we added this block:

select * from {{ ref('stage') }}
{% if target.name != 'prod' and var("dev_load_full") != true %}
where to_date("timestamp") >= dateadd('day', -{{ var("dev_load_days") }}, current_date())
{% endif %}

And we define those variables in the dbt_project.yml as default:

vars:
  dev_load_full: false
  dev_load_days: 7

This way we can sample the data early in the query rather than later.

jaypeedevlin · 2023-08-15T01:58:56Z

@jtcohen6 that was a reasonably arbitrary example of truly random sampling. Most of our models are incremental so we'd be (in theory) randomly sampling just the most recent data anyway. Definitely not the most sophisticated method for my example though, I'll grant you that, but just arguing to make this as configurable as possible.

graciegoheen · 2023-08-23T00:40:32Z

Really love the conversation going here - I have personally written a handful of hack-y work-arounds to accomplish this exact thing, so would be excited to see a native approach in dbt-core, especially for folks trying to optimize time and warehouse spend! Curious about the name "sample" vs. something like "limit" - I know we already have a limit config for test outputs and a limit for the dbt show command. Thinking more about it, I like that the name "sample" is distinguished from "limiting the outputs" (which it's not!).

I can also imagine folks wanting to configure different samples per environment, which is something we could enable using environment variables.

# dbt_project.yml
models:
  my_project:
    +sample: {{ env_var('sample') }}

I'm definitely in favor of folks being able to define a filter, which we recommend in our best practices guide and has come up in a handful of solutions.

@QuentinCoviaux my understanding here is that running dbt in --sample mode actually affects what {{ ref() }} and {{ source() }} compile to, rather than adding a limit at the end of a model. So, for example, if we had a model with many CTEs:

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

when executing in --sample mode, the {{ ref() }} statements here would compile to their raw location plus their configured limits. So, if we had configured samples as such:

models:
  - name: stg_customers
     config: 
       sample: 
         size: 10
  - name: stg_orders
     config: 
       sample: 
         size: 50
  - name: stg_payments
     config: 
       sample: 
         size: 75

the compiled code would look something like this:

with customers as (

    select * from my_warehouse.my_schema.stg_customers limit 10

),

orders as (

    select * from my_warehouse.my_schema.stg_orders limit 50

),

payments as (

    select * from my_warehouse.my_schema.stg_payments limit 75

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

In that case, we would actually be filtering as early as possible.

I have seen this sort of problem solved in 2 ways:

overriding the ref & source macros (this solution)
adding a limit macro to the end of relevant model (there's lots of old discussions on this)

I'd be curious @jtcohen6 and @MichelleArk for y'all's thoughts on the pros / cons of each approach - and why "sampling" is better than "limiting".

dbeatty10 · 2023-11-13T23:28:53Z

Insights about sampling

"Why not just take samples?"

Sampling is actually pretty hard. Data developers will sometimes use sample-ish hacks (like testing with the last 3 days’ worth of data, or using a sample function built into the database) but these can often come back to bite you. Samples need to take into account referential integrity and unevenly distributed data if they are to be any use when testing.

One solution

SnowShu is a sampling engine designed to support testing in data development.

Here's one solution for sampling by Health Union:

https://snowshu.readthedocs.io/en/latest/

Caveat: I haven't tried it out personally.

Interesting links

Here's some other interesting links from Health Union:

Medium: https://medium.com/@ethan.m.knox/introducing-snowshu-217d5bf7548a
PyPI: https://pypi.org/project/snowshu/
SnowShu: https://github.com/Health-Union/snowshu
Rest of their repos: https://github.com/Health-Union
Cross-database macros: https://github.com/Health-Union/dbt-xdb

graciegoheen · 2023-11-14T14:24:02Z

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from where limit . This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

Just adding a +1 here to being able to sample via adding a filter!

aranke · 2023-11-15T17:30:57Z

Great discussion here, just wanted to add my 2c stemming from a conversation with @dbeatty10:

Could we sample with an optional join_key config?

Here's an example of how I could see this working:

Sample selects n random rows from the customers table
join_key on transactions table allows us to find all transactions for those customers instead of randomly sampling from customers

This could be a natural extension to the work in #8652 (unit-test rows from seeds) to enable unit-testing on sample rows from models.

To wrap it all up, we could create a new “hybrid” test that takes this sample as an input but checks for data assertions as output (e.g., check that customer status is churned if they haven't made a purchase in the last 12 months).

This could help alleviate mocking fatigue in unit testing for fct/dim tables that depend on dozens of int/stg tables.

moseleyi · 2024-03-28T08:39:38Z

Sampling sounds like a great idea! I would also second the ask for full customisation of this configuration. Not just limit but ordering and even using a SQL filtering for it. For example in BigQuery, I would use one partition for example. However it could be expanded with time. Having it to begin with is a great start!

joellabes · 2024-10-31T17:42:26Z

Bonus considerations:

Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.

Just checking whether this will be included? (🥺)

MichelleArk added enhancement New feature or request triage labels Aug 12, 2023

github-actions bot changed the title ~~[Feature] run dbt in --sample mode~~ [CT-2967] [Feature] run dbt in --sample mode Aug 12, 2023

jtcohen6 removed the triage label Aug 14, 2023

jtcohen6 added the Refinement Maintainer input needed label Aug 14, 2023

graciegoheen mentioned this issue Nov 7, 2023

[CT-3309] --empty flag to to provide a cost-effective way to run a selection of empty models for dbt build and run #8980

Closed

1 task

graciegoheen added this to the v1.8 milestone Nov 9, 2023

This was referenced Nov 29, 2023

[CT-3431] [Feature] Add support for sql_footer #9167

Closed

Support --empty flag for schema-only dry runs #8971

Merged

graciegoheen mentioned this issue Jan 30, 2024

[Feature] event time config #9490

Closed

3 tasks

martynydbt modified the milestones: v1.8, v1.9 Feb 12, 2024

graciegoheen removed this from the v1.9 milestone Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2967] [Feature] run dbt in --sample mode #8378

[CT-2967] [Feature] run dbt in --sample mode #8378

MichelleArk commented Aug 12, 2023 •

edited by graciegoheen

Loading

jaypeedevlin commented Aug 14, 2023

jtcohen6 commented Aug 14, 2023 •

edited

Loading

QuentinCoviaux commented Aug 14, 2023

jaypeedevlin commented Aug 15, 2023

graciegoheen commented Aug 23, 2023 •

edited

Loading

dbeatty10 commented Nov 13, 2023

graciegoheen commented Nov 14, 2023

aranke commented Nov 15, 2023 •

edited

Loading

moseleyi commented Mar 28, 2024

joellabes commented Oct 31, 2024

[CT-2967] [Feature] run dbt in --sample mode #8378

[CT-2967] [Feature] run dbt in --sample mode #8378

Comments

MichelleArk commented Aug 12, 2023 • edited by graciegoheen Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jaypeedevlin commented Aug 14, 2023

jtcohen6 commented Aug 14, 2023 • edited Loading

QuentinCoviaux commented Aug 14, 2023

jaypeedevlin commented Aug 15, 2023

graciegoheen commented Aug 23, 2023 • edited Loading

dbeatty10 commented Nov 13, 2023

Insights about sampling

One solution

Interesting links

graciegoheen commented Nov 14, 2023

aranke commented Nov 15, 2023 • edited Loading

moseleyi commented Mar 28, 2024

joellabes commented Oct 31, 2024

MichelleArk commented Aug 12, 2023 •

edited by graciegoheen

Loading

jtcohen6 commented Aug 14, 2023 •

edited

Loading

graciegoheen commented Aug 23, 2023 •

edited

Loading

aranke commented Nov 15, 2023 •

edited

Loading