using partition in model config causes state:modified false positive #3645

justinwagg · 2021-07-29T02:21:57Z

Describe the bug

Using partitions= with static partitions as part of a model config can cause false positives when using state:modified.

Steps To Reproduce

I have the following model

# bugtest.sql
{%- set now = modules.datetime.datetime.utcnow() -%}
{%- set backfill_days = 2 -%}
{%- set start_date_str = (now - modules.datetime.timedelta(days=backfill_days)).strftime("%Y-%m-%d %H:%M:%S") -%}
{%- set end_date_str = now.strftime("%Y-%m-%d %H:%M:%S") -%}

{%- set partitions_to_replace = dates_in_range(
  start_date_str,
  end_date_str,
  in_fmt="%Y-%m-%d %H:%M:%S",
  out_fmt="'%Y-%m-%d %H:%M:%S'"
)-%}

{{
  config(
    materialized = 'incremental',
    partition_by = {
      'field': 'created_at', 
      'data_type': 'timestamp'
    },
    incremental_strategy = 'insert_overwrite',
    partitions = partitions_to_replace
  )
}}


select created_at
,user_id
from {{ source('schema', 'table') }}
limit 1

I run

mkdir ../targetbackup
dbt docs generate --profiles-dir .
cp -r ./target/ ../targetbackup/

to compile the project and backup the target folder (as would be done to store artifacts). Immediately after with no model changes, I run

dbt ls -m state:modified --profiles-dir . --state ../targetbackup
# and/or run
dbt docs generate --profiles-dir .
dbt ls -m state:modified --profiles-dir . --state ../targetbackup

and am told my model has been modified.

Expected behavior

I expect the file to not be identified as changed. As the checksum indicates, it has not been changed.

Screenshots and log output

I compare the two manifest.json files, to see what has changed.

I can see the checksum of the file is the same, it has not changed, but the partitions have changed.

My outputs:

System information

Which database are you using dbt with?

The output of dbt --version:

(venv) ➜  bugtest  dbt --version
installed version: 0.20.0
   latest version: 0.20.0

Up to date!

Plugins:
  - bigquery: 0.20.0
  - snowflake: 0.20.0
  - redshift: 0.20.0
  - postgres: 0.20.0

The operating system you're using:

OSX Big Sur 11.3.1

The output of python --version:

(venv) ➜  bugtest  python --version
Python 3.8.0

Additional context

In the Slim CI scenario I have seen many false positives occurring despite no models changing PR to PR. When digging into why, the models that are said to have changed have the usage of static partitions= in common.

I have read the issue #2941 and checked the macro dates_in_range, which appears to return a list, not a tuple. https://github.com/dbt-labs/dbt/blob/1dd4187cd016bcf2ace5518eb7304645a2394655/core/dbt/include/global_project/macros/etc/datetime.sql#L13

This doesn't seem to happen when using the example docs example here. I'm assuming that's because the dates are rendered in the database not in dbt.

I did not double check this, but I recognized this back in 0.18.0 as well.

Thank you for looking at this! Cheers.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-08-03T02:51:38Z

Thanks for opening @justinwagg, and for the detailed reproduction case!

This is a known limitation of state comparison, under the category of false positives. The linked documentation focuses in greater depth on env-specific logic, since that's much more common, but the same principle applies: a config is rendered, stored in the manifest.json, and then used as comparison against the same Jinja expression rendered in a different environment/time/context, with different results.

It's exactly as you say—if you take a look at the manifests produced in each run, you'll see a different values of partitions in unrendered_config. So why call it unrendered_config, when those values are clearly rendered? In v0.19, we introduced the ability to store and compare unrendered Jinja expressions passed to configs in dbt_project.yml. Truthfully, though, dbt is not yet clever enough to do the same for configs set in the Jinja config() macro. It has to render the entire Jinja template to store the final results, and it can't pause to store the intermediate results halfway through.

There are a few proposed resolutions to this:

Medium-term: We could split up state:modified into sub-selectors (Subselectors for state:modified #2704), and offer users finer control—the ability to account for some modification reasons but not others. In this case, the bugtest model is being selected by state:modified.configs (which compares unrendered_configs), but it would not be selected by state:modified.body (which compares only checksum).
Long-term: We're slowly but surely building out the constructs to eventually analyze, extract, and store the unrendered config values, and use those for state:modified comparisons instead (Use static analyzer to extract unrendered_config from config() #3680).

short-term workaround

In the short term, could you get away without these being timestamps, but instead being dates? If you adjust the datetime formatting to exclude time values:

{%- set now = modules.datetime.datetime.utcnow() -%}
{%- set backfill_days = 2 -%}
{%- set start_date_str = (now - modules.datetime.timedelta(days=backfill_days)).strftime("%Y-%m-%d") -%}
{%- set end_date_str = now.strftime("%Y-%m-%d") -%}

{%- set partitions_to_replace = dates_in_range(
  start_date_str,
  end_date_str,
  in_fmt="%Y-%m-%d",
  out_fmt="'%Y-%m-%d"
)-%}

Then it returns:

"partitions": [
    "'2021-08-01",
    "'2021-08-02",
    "'2021-08-03"
],

So long as you're comparing against a manifest produced on the same UTC day, dbt wouldn't detect a false-positive modification.

Given that the medium- and long-term resolutions to this issue are well understood, and explained in further detail elsewhere, I'm going to close this issue for now.

justinwagg · 2021-08-03T03:37:59Z

Thank you @jtcohen6 for such detail, I appreciate it! This all makes complete sense. I was suspicious this fell into the known limitation category, and hoped the extra detail would help make it quick to determine if so!

unrendered_config what a name and place for rendered configs 🤣...

I'm already using dates, and came across this as manifests went stale over the weekend in a bucket waiting to be used on Monday in CI. For now I've solved by checking out both main and the feature branch, generating both manifests at once, and then comparing, tricking the config into being the same. It would not resolve the timestamp situation, but for now dates are all I need!

Thanks again! 😎

justinwagg added bug Something isn't working triage labels Jul 29, 2021

jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Aug 3, 2021

jtcohen6 closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using partition in model config causes state:modified false positive #3645

using partition in model config causes state:modified false positive #3645

justinwagg commented Jul 29, 2021

jtcohen6 commented Aug 3, 2021

justinwagg commented Aug 3, 2021

using partition in model config causes state:modified false positive #3645

using partition in model config causes state:modified false positive #3645

Comments

justinwagg commented Jul 29, 2021

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

jtcohen6 commented Aug 3, 2021

short-term workaround

justinwagg commented Aug 3, 2021