Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using partition in model config causes state:modified false positive #3645

Closed
1 of 5 tasks
justinwagg opened this issue Jul 29, 2021 · 2 comments
Closed
1 of 5 tasks
Labels
bug Something isn't working state Stateful selection (state:modified, defer)

Comments

@justinwagg
Copy link

Describe the bug

Using partitions= with static partitions as part of a model config can cause false positives when using state:modified.

Steps To Reproduce

I have the following model

# bugtest.sql
{%- set now = modules.datetime.datetime.utcnow() -%}
{%- set backfill_days = 2 -%}
{%- set start_date_str = (now - modules.datetime.timedelta(days=backfill_days)).strftime("%Y-%m-%d %H:%M:%S") -%}
{%- set end_date_str = now.strftime("%Y-%m-%d %H:%M:%S") -%}

{%- set partitions_to_replace = dates_in_range(
  start_date_str,
  end_date_str,
  in_fmt="%Y-%m-%d %H:%M:%S",
  out_fmt="'%Y-%m-%d %H:%M:%S'"
)-%}

{{
  config(
    materialized = 'incremental',
    partition_by = {
      'field': 'created_at', 
      'data_type': 'timestamp'
    },
    incremental_strategy = 'insert_overwrite',
    partitions = partitions_to_replace
  )
}}


select created_at
,user_id
from {{ source('schema', 'table') }}
limit 1

I run

mkdir ../targetbackup
dbt docs generate --profiles-dir .
cp -r ./target/ ../targetbackup/

to compile the project and backup the target folder (as would be done to store artifacts). Immediately after with no model changes, I run

dbt ls -m state:modified --profiles-dir . --state ../targetbackup
# and/or run
dbt docs generate --profiles-dir .
dbt ls -m state:modified --profiles-dir . --state ../targetbackup

and am told my model has been modified.

Expected behavior

I expect the file to not be identified as changed. As the checksum indicates, it has not been changed.

Screenshots and log output

I compare the two manifest.json files, to see what has changed.

Screen Shot 2021-07-28 at 9 53 55 PM

I can see the checksum of the file is the same, it has not changed, but the partitions have changed.

My outputs:

Screen Shot 2021-07-28 at 10 04 36 PM

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

(venv) ➜  bugtest  dbt --version
installed version: 0.20.0
   latest version: 0.20.0

Up to date!

Plugins:
  - bigquery: 0.20.0
  - snowflake: 0.20.0
  - redshift: 0.20.0
  - postgres: 0.20.0

The operating system you're using:

OSX Big Sur 11.3.1

The output of python --version:

(venv) ➜  bugtest  python --version
Python 3.8.0

Additional context

In the Slim CI scenario I have seen many false positives occurring despite no models changing PR to PR. When digging into why, the models that are said to have changed have the usage of static partitions= in common.

I have read the issue #2941 and checked the macro dates_in_range, which appears to return a list, not a tuple. https://github.com/dbt-labs/dbt/blob/1dd4187cd016bcf2ace5518eb7304645a2394655/core/dbt/include/global_project/macros/etc/datetime.sql#L13

This doesn't seem to happen when using the example docs example here. I'm assuming that's because the dates are rendered in the database not in dbt.

I did not double check this, but I recognized this back in 0.18.0 as well.

Thank you for looking at this! Cheers.

@justinwagg justinwagg added bug Something isn't working triage labels Jul 29, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 3, 2021

Thanks for opening @justinwagg, and for the detailed reproduction case!

This is a known limitation of state comparison, under the category of false positives. The linked documentation focuses in greater depth on env-specific logic, since that's much more common, but the same principle applies: a config is rendered, stored in the manifest.json, and then used as comparison against the same Jinja expression rendered in a different environment/time/context, with different results.

It's exactly as you say—if you take a look at the manifests produced in each run, you'll see a different values of partitions in unrendered_config. So why call it unrendered_config, when those values are clearly rendered? In v0.19, we introduced the ability to store and compare unrendered Jinja expressions passed to configs in dbt_project.yml. Truthfully, though, dbt is not yet clever enough to do the same for configs set in the Jinja config() macro. It has to render the entire Jinja template to store the final results, and it can't pause to store the intermediate results halfway through.

There are a few proposed resolutions to this:

  • Medium-term: We could split up state:modified into sub-selectors (Subselectors for state:modified #2704), and offer users finer control—the ability to account for some modification reasons but not others. In this case, the bugtest model is being selected by state:modified.configs (which compares unrendered_configs), but it would not be selected by state:modified.body (which compares only checksum).
  • Long-term: We're slowly but surely building out the constructs to eventually analyze, extract, and store the unrendered config values, and use those for state:modified comparisons instead (Use static analyzer to extract unrendered_config from config() #3680).

short-term workaround

In the short term, could you get away without these being timestamps, but instead being dates? If you adjust the datetime formatting to exclude time values:

{%- set now = modules.datetime.datetime.utcnow() -%}
{%- set backfill_days = 2 -%}
{%- set start_date_str = (now - modules.datetime.timedelta(days=backfill_days)).strftime("%Y-%m-%d") -%}
{%- set end_date_str = now.strftime("%Y-%m-%d") -%}

{%- set partitions_to_replace = dates_in_range(
  start_date_str,
  end_date_str,
  in_fmt="%Y-%m-%d",
  out_fmt="'%Y-%m-%d"
)-%}

Then it returns:

"partitions": [
    "'2021-08-01",
    "'2021-08-02",
    "'2021-08-03"
],

So long as you're comparing against a manifest produced on the same UTC day, dbt wouldn't detect a false-positive modification.

Given that the medium- and long-term resolutions to this issue are well understood, and explained in further detail elsewhere, I'm going to close this issue for now.

@jtcohen6 jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Aug 3, 2021
@jtcohen6 jtcohen6 closed this as completed Aug 3, 2021
@justinwagg
Copy link
Author

Thank you @jtcohen6 for such detail, I appreciate it! This all makes complete sense. I was suspicious this fell into the known limitation category, and hoped the extra detail would help make it quick to determine if so!

unrendered_config what a name and place for rendered configs 🤣...

I'm already using dates, and came across this as manifests went stale over the weekend in a bucket waiting to be used on Monday in CI. For now I've solved by checking out both main and the feature branch, generating both manifests at once, and then comparing, tricking the config into being the same. It would not resolve the timestamp situation, but for now dates are all I need!

Thanks again! 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working state Stateful selection (state:modified, defer)
Projects
None yet
Development

No branches or pull requests

2 participants