Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1196] [Bug] Inconsistency between dbt_valid_from and dbt_valid_to #5867

Closed
2 tasks done
mats-wetterlund-pierce opened this issue Sep 16, 2022 · 10 comments
Closed
2 tasks done
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code user docs [docs.getdbt.com] Needs better documentation

Comments

@mats-wetterlund-pierce
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I found an issue, or multiple, with snapshots using timestamp strategy. It gives inconsistency in data and some info is missing in documentation about the timestamps and how it works.
I have tried search for bug reports but haven't found much in the area except this issue #4347 which I may think is the reason for the problem I have, but as I haven't tested this before I don't know if this is a regression.

More detailed is that I get dbt_valid_to to be before dbt_valid_from on hard deletes.
The main reason in my opinion is that the data used for dbt_valid_from is the data that is used from the source to the snapshot and with no regards to time zone, the dbt_valid_to is set to UTC time zone on hard deletes, but there is no time zone in either timestamp in the snapshot table.
As well there is no validation that the dbt_valid_to is after or equal to dbt_valid_from.

Expected Behavior

That dbt_valid_to never is before dbt_valid_from for the same record.
That the documentation is updated that the field used for updated_at need to be in UTC time zone and that dbt_valid_to will be in UTC time zone on hard deletes, or that an attribute is added to the snapshot config where one can specify what time zone should be used for the time stamps.

Steps To Reproduce

Snapshot code - data_snapshot.sql:

{% snapshot data_snapshot %}

{{
    config(
      strategy='timestamp',
      unique_key='data_key',
      updated_at='last_modified',
      invalidate_hard_deletes = true
    )
}}

SELECT
    data_key,
    last_modified
FROM
    raw_table

{% endsnapshot %}

Create a table with data:

CREATE TABLE raw_table AS
SELECT 'abc' AS data_key, CAST(CONVERT_TIMEZONE('Europe/Stockholm', CURRENT_TIMESTAMP()) AS DATETIME) AS last_modified;
DATA_KEY LAST_MODIFIED
abc 2022-09-16 15:20:25.977000000

Run snapshot:

dbt snapshot --select data_snapshot
DATA_KEY LAST_MODIFIED DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
abc 2022-09-16 15:20:25.977000000 213a5f3f2f3a37846d0467d7964eb7bb 2022-09-16 15:20:25.977000000 2022-09-16 15:20:25.977000000

Remove source data:

DELETE FROM raw_table WHERE data_key = 'abc';

Run snapshot:

dbt snapshot --select data_snapshot
DATA_KEY LAST_MODIFIED DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
abc 2022-09-16 15:20:25.977000000 213a5f3f2f3a37846d0467d7964eb7bb 2022-09-16 15:20:25.977000000 2022-09-16 15:20:25.977000000 2022-09-16 13:40:48.814000000

dbt_valid_to is before dbt_valid_to


To complicate it more, not perhaps a that common real world scenario.

Add data, the timestamp to be before the last_modified in the previous data but after the dbt_valid_to value in the snapshot.
INSERT INTO raw_table VALUES ('abc', '2022-09-16 14:20:00.000000000');

Run snapshot:

dbt snapshot --select data_snapshot
DATA_KEY LAST_MODIFIED DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
abc 2022-09-16 15:20:25.977000000 213a5f3f2f3a37846d0467d7964eb7bb 2022-09-16 15:20:25.977000000 2022-09-16 15:20:25.977000000 2022-09-16 13:40:48.814000000
abc 2022-09-16 14:20:00.000000000 7e0dfe7d09cc4895c551c6a4b1944c06 2022-09-16 14:20:00.000000000 2022-09-16 14:20:00.000000000

The dbt_valid_from on the second set of data is before the dbt_valid_from in the first set of data.

Additional update of the data, the timestamp is after the timestamp in the first data:

UPDATE raw_table SET last_modified = '2022-09-16 16:20:00.000000000' WHERE data_key = 'abc';

Run snapshot:

dbt snapshot --select data_snapshot
DATA_KEY LAST_MODIFIED DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
abc 2022-09-16 15:20:25.977000000 213a5f3f2f3a37846d0467d7964eb7bb 2022-09-16 15:20:25.977000000 2022-09-16 15:20:25.977000000 2022-09-16 13:40:48.814000000
abc 2022-09-16 14:20:00.000000000 7e0dfe7d09cc4895c551c6a4b1944c06 2022-09-16 14:20:00.000000000 2022-09-16 14:20:00.000000000 2022-09-16 16:20:00.000000000
abc 2022-09-16 16:20:00.000000000 dbddbe4ac1fab10d820aa6d4ab956674 2022-09-16 16:20:00.000000000 2022-09-16 16:20:00.000000000

Relevant log output

No response

Environment

dbt cloud: v1.0

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@mats-wetterlund-pierce mats-wetterlund-pierce added bug Something isn't working triage labels Sep 16, 2022
@github-actions github-actions bot changed the title [Bug] <title> [CT-1196] [Bug] <title> Sep 16, 2022
@mats-wetterlund-pierce mats-wetterlund-pierce changed the title [CT-1196] [Bug] <title> [Bug] Inconsistency between dbt_valid_from and dbt_valid_to Sep 16, 2022
@leahwicz leahwicz changed the title [Bug] Inconsistency between dbt_valid_from and dbt_valid_to [CT-1196] [Bug] Inconsistency between dbt_valid_from and dbt_valid_to Sep 16, 2022
@jtcohen6
Copy link
Contributor

One thing I quickly caught:

dbt cloud: v1.0

I have tried search for bug reports but haven't found much in the area except this issue #4347 which I may think is the reason for the problem I have, but as I haven't tested this before I don't know if this is a regression.

The changes fixing that issue (#4513 + #5077) were included in v1.1. They were not included in v1.0.

@jtcohen6 jtcohen6 added snapshots Issues related to dbt's snapshot functionality Team:Adapters Issues designated for the adapter area of the code labels Sep 19, 2022
@dbeatty10 dbeatty10 self-assigned this Sep 24, 2022
@dbeatty10
Copy link
Contributor

@mats-wetterlund-pierce Thanks for opening this! 📅 🕙

Thanks to your detailed examples, I was able to reproduce everything you described.

What is going on

When using the timestamp strategy, there are two different things you highlighted that can definitely affect the output of snapshots:

  1. if updated_at isn't monotonically increasing for each unique_key value for each successive snapshot
  2. if updated_at has the timestamp_ntz data type, but it represents a local time zone other than UTC

Where to go from here

I don't think this is a bug, but rather just the result of the timestamp strategy and hard-deletes expecting updated_at to have certain properties.

(I don't think that the check strategy can even have something like this happen since all of its timestamps are guaranteed to use the same source timezone and be monotonically increasing.)

You proposed a few things that we can (and should!) take action on, namely:

  1. Update the docs that dbt_valid_to will be in UTC time zone on hard deletes
  2. Update the docs that the field used for updated_at needs to be in UTC time zone

There's also some things you can do (that you might have already discovered on your own):

  1. Transform any timestamps that are in a local time zone into UTC before taking the snapshot.
    • You will retain the option to transform the output of the snapshot into a some other business-relevant timezone in downstream transformation steps
  2. Ensure that the updated_at timestamp reliably gets larger or stays the same for each key (monotonically increases)
    • if not, you'll either need to use the check strategy instead or produce your Type 2 SCD via a method other than snapshots (like using either lead or lag within a window function)
  3. Ensure that the updated_at timestamp is <= convert_timezone('UTC', current_timestamp() in your source data
    • Using dbt_utils.expression_is_true is one way to do that test. See here and here for examples.

What you could do, but we don't recommend

⚠️ The following is not a recommendation, just transparency ⚠️

You asked about an attribute being added to the "snapshot config where one can specify what time zone should be used for the time stamps".

Although not recommended, sharing the following for completeness' sake.

Rather than transforming all timestamps into UTC, you can override the following macro in your local project:

{% macro snowflake__snapshot_get_time() -%}
  {{ log("snowflake__snapshot_get_time", info=True) }}
  to_timestamp_ntz({{ current_timestamp() }})
{%- endmacro %}

I didn't actually test this out, but I think you'd just replace to_timestamp_ntz({{ current_timestamp() }}) with convert_timezone('Europe/Stockholm', current_timestamp()). Then when you do snapshots, they'd use this macro for the timestamp for hard deletes (and also for insert times if you switched to the check strategy).

Why is this not recommended?

One reason is that (to the best of my knowledge) we consider this a "private" macro that is an implementation detail rather than being an intentional interface for user-configurable behavior. As such, it's liable to change in a future release without us considering it a breaking change.

Next steps

I'm planning on closing this in favor of making some documentation updates instead. I'll create an issue in https://github.com/dbt-labs/dbt-docs/issues and tag you in it so you can track along.

Thank you for highlighting where there are crucial details we should communicate better!

@dbeatty10 dbeatty10 removed the triage label Nov 23, 2022
@jtcohen6 jtcohen6 added the user docs [docs.getdbt.com] Needs better documentation label Nov 23, 2022
@mats-wetterlund-pierce
Copy link
Author

mats-wetterlund-pierce commented Nov 24, 2022

Update the docs that the field used for updated_at needs to be in UTC time zone

Why not change the code for snapshots so all updated_at, dbt_valid_from and dbt_valid_to are in time zone format with UTC time zone? Instead of each developer needs to take this into consideration and make sure that the incoming data is transformed?

As well when using the dbt_valid_to and dbt_valid_from from the snapshots it causes problems when they are depending on the in data used for the timestamps as some are in NTZ and some in TZ formats, as then as a developer you need to remember which are in NTZ format and convert them when using the data from snapshots and especially when combining data from multiple snapshots.

So all in all I think there would be a lot less issues for developers if all timestamps in the snapshots where in UTC TZ format and ensured to be that by the dbt core code.

@RobbertDM
Copy link
Contributor

RobbertDM commented Dec 29, 2022

Hi, I just ran into this issue too while making dbt snapshots with 'check' strategy. While the problem likely wouldn't result in a failing system for us, it was confusing to see snapshots' updated_at fields an hour back in time (we're in Brussels timezone). I think the essence is that updated_at (and the other timestamp fields created when making a dbt snapshot) is a timestamp that should be interpreted as a UTC timestamp.

Therefore, I fully agree with @mats-wetterlund-pierce that it should also be stored as UTC TZ timestamp.

@mats-wetterlund-pierce
Copy link
Author

mats-wetterlund-pierce commented Jan 30, 2023

I don't know if what I found now would be considered a separate bug or just a variation of what I submitted before.

In short when an attribute in timestamptz format is used for updated_at the timestamp attributes in the snapshot are created as timestamptz but the code doesn't take that into consideration when updating the data in the snapshots.
Mainly it is in the deletes subquery on subsequent runs the problem is:
to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_to,
Where it expressly converts the data for dbt_valid_to into a NTZ timestamp that will be used in a TZ field.

In our case the default timezone set in Snowflake is "America/Los_Angeles" -08:00, so we get i.e. "2023-01-30 03:19:10.863000000 -08:00" on hard delete instead of "2023-01-30 03:19:10.863000000 +00:00" causing us to get overlapping data if the same key comes back in the data again within the 8 hours added by dbt for the valid_to.

Currently the only workaround I see is to ensure that the default timezone in the DB is UTC.

The snapshot model is:

{% snapshot raw_stock_snapshot %}

{{
    config(
      strategy = 'timestamp',
      unique_key = 'article_id',
      updated_at = 'last_modified',
      invalidate_hard_deletes = true
    )
}}

SELECT
    article_id,
    internal_stock,
    external_stock,
    sold_items,
    f_timestamp_ntz_to_timestamp_tz(last_modified, '+00:00') AS last_modified,
    _fivetran_synced,
    f_get_utc_timestamp()                                    AS insert_datetime
FROM
    {{ source('source_system', 'stock') }}

{% endsnapshot %}

The output for first time build:

create or replace transient table ANALYTICS_DEV.dbt_raw_snapshots.raw_stock_snapshot  as
        (
    select *,
        md5(coalesce(cast(article_id as varchar ), '')
         || '|' || coalesce(cast(last_modified as varchar ), '')
        ) as dbt_scd_id,
        last_modified as dbt_updated_at,
        last_modified as dbt_valid_from,
        nullif(last_modified, last_modified) as dbt_valid_to
    from (
SELECT
    article_id,
    internal_stock,
    external_stock,
    sold_items,
    f_timestamp_ntz_to_timestamp_tz(last_modified, '+00:00') AS last_modified,
    _fivetran_synced,
    f_get_utc_timestamp()                                    AS insert_datetime
FROM
    raw.source_system.stock
    ) sbq
        );

Output on subsequent build:

create or replace temporary table "ANALYTICS_DEV"."DBT_RAW_SNAPSHOTS"."RAW_STOCK_SNAPSHOT__dbt_tmp"  as
        (with snapshot_query as (
SELECT
    article_id,
    internal_stock,
    external_stock,
    sold_items,
    f_timestamp_ntz_to_timestamp_tz(last_modified, '+00:00') AS last_modified,
    _fivetran_synced,
    f_get_utc_timestamp()                                    AS insert_datetime
FROM
    raw.source_system.stock
    ),
    snapshotted_data as (
        select *,
            article_id as dbt_unique_key
        from "ANALYTICS_DEV"."DBT_RAW_SNAPSHOTS"."RAW_STOCK_SNAPSHOT"
        where dbt_valid_to is null
    ),
    insertions_source_data as (
        select
            *,
            article_id as dbt_unique_key,
            last_modified as dbt_updated_at,
            last_modified as dbt_valid_from,
            nullif(last_modified, last_modified) as dbt_valid_to,
            md5(coalesce(cast(article_id as varchar ), '')
         || '|' || coalesce(cast(last_modified as varchar ), '')
        ) as dbt_scd_id
        from snapshot_query
    ),
    updates_source_data as (
        select
            *,
            article_id as dbt_unique_key,
            last_modified as dbt_updated_at,
            last_modified as dbt_valid_from,
            last_modified as dbt_valid_to
        from snapshot_query
    ),
    deletes_source_data as (
        select
            *,
            article_id as dbt_unique_key
        from snapshot_query
    ),
    insertions as (
        select
            'insert' as dbt_change_type,
            source_data.*
        from insertions_source_data as source_data
        left outer join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (
                snapshotted_data.dbt_unique_key is not null
            and (
                (snapshotted_data.dbt_valid_from < source_data.last_modified)
            )
        )
    ),
    updates as (
        select
            'update' as dbt_change_type,
            source_data.*,
            snapshotted_data.dbt_scd_id
        from updates_source_data as source_data
        join snapshotted_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where (
            (snapshotted_data.dbt_valid_from < source_data.last_modified)
        )
    ),
    deletes as (
        select
            'delete' as dbt_change_type,
            source_data.*,
            to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_from,
            to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_updated_at,
            to_timestamp_ntz(convert_timezone('UTC', current_timestamp())) as dbt_valid_to,
            snapshotted_data.dbt_scd_id
        from snapshotted_data
        left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where source_data.dbt_unique_key is null
    )
    select * from insertions
    union all
    select * from updates
    union all
    select * from deletes
        );

@Fleid Fleid assigned Fleid and unassigned dbeatty10 Feb 14, 2023
@dbeatty10
Copy link
Contributor

@mats-wetterlund-pierce and @RobbertDM I understand where you are coming from with wanting to standardize on TZ timestamps with an explicit UTC offset of +0000 in Snowflake (something like 2023-02-14 20:00:00.000 +0000).

As you know, our current implementation is standardized on NTZ, with an implied UTC offset of +0000 (so 2023-02-14 20:00:00.000 instead).

Changing that current default implementation would be a breaking change, so it's not something we are willing to do.

But I can show you how to use the TZ data type for snapshots using the timestamp strategy in dbt-snowflake (given the caveats called out below).

Using timestamp_tz for snapshots

⚠️ Caveat emptor
I'd consider everything that follows "Free Solo" mode. It relies on internal implementation details that we haven't written about in other venues or declared as public interfaces. Although I'm guessing we won't, we could make a change in any version that would affect the functionality of the example below.

Example code for you to try

Click to toggle contents of `models/data_versions.sql`

models/data_versions.sql

{{
    config(
        materialized='table'
    )
}}

    -- 1 hour in the past
    select
        1 as table_version,
        'abc' AS data_key,
        cast(convert_timezone('UTC',                 dateadd(hour, -1, date_trunc('hour', current_timestamp())) ) as datetime) as last_modified_UTC_ntz,
        cast(convert_timezone('America/Los_Angeles', dateadd(hour, -1, date_trunc('hour', current_timestamp())) ) as datetime) as last_modified_LAX_ntz,
             convert_timezone('America/Los_Angeles', dateadd(hour, -1, date_trunc('hour', current_timestamp())) )              as last_modified_tz,
                                                     dateadd(hour, -1, date_trunc('hour', current_timestamp()))                as last_modified_ltz

    union all

    -- current hour
    select
        2 as table_version,
        'abc' AS data_key,
        cast(convert_timezone('UTC',                 date_trunc('hour', current_timestamp())) as datetime) as last_modified_UTC_ntz,
        cast(convert_timezone('America/Los_Angeles', date_trunc('hour', current_timestamp())) as datetime) as last_modified_LAX_ntz,
             convert_timezone('America/Los_Angeles', date_trunc('hour', current_timestamp()))              as last_modified_tz,
                                                     date_trunc('hour', current_timestamp())               as last_modified_ltz
    
    union all

    -- "delete" the previous data by not having any rows for table_version = 3

    -- 1 hour into the future
    select
        4 as table_version,
        'abc' AS data_key,
        cast(convert_timezone('UTC',                 dateadd(hour, 1, date_trunc('hour', current_timestamp())) ) as datetime) as last_modified_UTC_ntz,
        cast(convert_timezone('America/Los_Angeles', dateadd(hour, 1, date_trunc('hour', current_timestamp())) ) as datetime) as last_modified_LAX_ntz,
             convert_timezone('America/Los_Angeles', dateadd(hour, 1, date_trunc('hour', current_timestamp())) )              as last_modified_tz,
                                                     dateadd(hour, 1, date_trunc('hour', current_timestamp()))                as last_modified_ltz
Click to toggle contents of `snapshots/my_snapshot_tz.sql`

snapshots/my_snapshot_tz.sql

{% snapshot my_snapshot_tz %}

{{
    config(
      strategy='timestamp',
      unique_key='data_key',
      updated_at='last_modified_tz',
      invalidate_hard_deletes=true,
      target_schema=target.schema
    )
}}

select
    table_version,
    data_key,
    last_modified_LAX_ntz,
    last_modified_UTC_ntz,
    last_modified_ltz,
    last_modified_tz

from {{ ref("data_versions") }} as source

where
    source.table_version = {{ var("table_version", 1) }}

{% endsnapshot %}
Click to toggle contents of `macros/drop_my_snapshot_tz.sql`

macros/drop_my_snapshot_tz.sql

{# --This is only included so the example can be re-run repeatedly from the beginning #}
{% macro drop_my_snapshot_tz() %}

  {% set query %}
    drop table if exists {{ ref("my_snapshot_tz") }}
  {% endset %}

  {{ log(query, True) }}
  {% do run_query(query) %}

{% endmacro %}
Click to toggle contents of `macros/snapshot_get_time.sql`

This the piece that will allow use the timestamp_tz data type and render according the local timezone of your session when you do hard deletes.

⚠️ it overrides the default implementation here, which is not currently documented as a user-facing public interface.

macros/snapshot_get_time.sql

{% macro snapshot_get_time() -%}
    current_timestamp()::timestamp_tz
{%- endmacro %}

Then run the following in your terminal:

dbt run -s data_versions
dbt run-operation drop_my_snapshot_tz
dbt snapshot --select my_snapshot_tz --vars '{"table_version": "2"}'
dbt snapshot --select my_snapshot_tz --vars '{"table_version": "3"}'
dbt snapshot --select my_snapshot_tz --vars '{"table_version": "4"}'

In between snapshots, you can query the table if you like. At the end, the valid_from/to values look like this:

image

Summary

The crucial thing is that the data types and semantics match between the snapshot_get_time() macro and whatever is specified as the updated_at column in your snapshot config.

The most recent thing you ran into @mats-wetterlund-pierce is that dbt is trusting that updated_at has the NTZ data type (representing UTC) and it's default implementation of the snapshot_get_time() macro used by hard deletes matches it within an NTZ data type.

@dbeatty10 dbeatty10 assigned dbeatty10 and unassigned Fleid Feb 16, 2023
@mats-wetterlund-pierce
Copy link
Author

The first "flaw" in your comment is:

As you know, our current implementation is standardized on NTZ

If it would be as stated there wouldn't be any issues as all timestamps would be converted with the same parameters and still be correct in relation to each other. The current implementation is a mix of TZ and NTZ attributes, requirements are undocumented and the implementation is flawed due to this and then with a approach to not correct the flaws.

Changing that current default implementation would be a breaking change, so it's not something we are willing to do.

So if there are bugs you preferer to keep them and not implement breaking changes that will improve the product?
I don't believe that is a good way to develop a product and was trying to report issues with the implementations to help out to improve the product but that seems to just been a waste of time.

@dbeatty10
Copy link
Contributor

I'm sorry @mats-wetterlund-pierce -- it was not my intent for you to feel like your time has been wasted!

We do take your feedback seriously and I agree that there are things we can do to improve the experience of using snapshots in dbt.

I did some writing offline and tried to outline three categories of things:

  1. key details to add to the documentation for snapshots
  2. criticisms of the current snapshot implementations (by you or others)
  3. potential ideas to ease the use of snapshots

The 1st category can be implemented with improvements to the docs here, and I've opened a new issue to cover the key information:

For the 2nd and 3rd categories, I created a new GitHub Discussion to allow for further open-ended dialogue and exploration:

I tried to capture and summarize things you've run into and reported. Would definitely invite your thoughts in that discussion on how snapshots could behave and how to accomplish it.

@dbeatty10 dbeatty10 removed the triage label Feb 21, 2023
@dbeatty10 dbeatty10 removed their assignment Feb 21, 2023
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 21, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests

5 participants