-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Unit test infers the wrong data type when the first record has some NULL values #821
Comments
Thanks for reporting this @leesiongchan ! I wasn't able to replicate this on my first try. Could you try out the files below and make any tweaks needed to trigger the error you saw? Project files and commands
select
cast('2024-01-01 10:01:00' as timestamp) as created_at,
1 as user_id,
true as is_closed,
false as is_validated,
'created' as user_action
select 1 as id
-- {{ ref("stg_mycfg__user_snapshot") }}
select * from {{ ref("stg_mycfg__user_events") }}
unit_tests:
- name: test_user_events_combined
model: int_mycfg__user_events_combined
given:
- input: ref('stg_mycfg__user_events')
rows:
- {
created_at: "2024-01-01 10:01:00",
user_id: 1,
is_closed: true,
is_validated: true,
user_action: created,
}
- {
created_at: "2024-01-01 10:02:00",
user_id: 2,
is_closed: null,
is_validated: true,
user_action: validated,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 3,
is_closed: true,
is_validated: null,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 4,
is_closed: null,
is_validated: null,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 5,
is_closed: false,
is_validated: true,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 6,
is_closed: true,
is_validated: true,
user_action: created,
}
- input: ref('stg_mycfg__user_snapshot')
rows: []
expect:
rows:
- { user_id: 1, is_closed: true, is_validated: true }
- { user_id: 2, is_closed: null, is_validated: true }
- { user_id: 3, is_closed: true, is_validated: null }
- { user_id: 4, is_closed: null, is_validated: null }
- { user_id: 5, is_closed: false, is_validated: true }
- { user_id: 6, is_closed: true, is_validated: true } Run this command: dbt build -s +stg_mycfg__user_events+1 |
You have to change your first record with some NULL values: eg. unit_tests:
- name: test_user_events_combined
model: int_mycfg__user_events_combined
given:
- input: ref('stg_mycfg__user_events')
rows:
- {
created_at: "2024-01-01 10:01:00",
user_id: 1,
is_closed: null,
is_validated: null,
user_action: created,
}
- {
created_at: "2024-01-01 10:02:00",
user_id: 2,
is_closed: null,
is_validated: true,
user_action: validated,
}
... |
@leesiongchan This worked for me -- is it not working for you? unit_tests:
- name: test_user_events_combined
model: int_mycfg__user_events_combined
given:
- input: ref('stg_mycfg__user_events')
rows:
- {
created_at: "2024-01-01 10:01:00",
user_id: 1,
is_closed: null,
is_validated: null,
user_action: created,
}
- {
created_at: "2024-01-01 10:02:00",
user_id: 2,
is_closed: null,
is_validated: true,
user_action: validated,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 3,
is_closed: true,
is_validated: null,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 4,
is_closed: true,
is_validated: true,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 5,
is_closed: false,
is_validated: true,
user_action: created,
}
- {
created_at: "2024-01-01 10:01:00",
user_id: 6,
is_closed: true,
is_validated: true,
user_action: created,
}
- input: ref('stg_mycfg__user_snapshot')
rows: []
expect:
rows:
- { user_id: 1, is_closed: null, is_validated: null }
- { user_id: 2, is_closed: null, is_validated: true }
- { user_id: 3, is_closed: true, is_validated: null }
- { user_id: 4, is_closed: true, is_validated: true }
- { user_id: 5, is_closed: false, is_validated: true }
- { user_id: 6, is_closed: true, is_validated: true } |
Ah! I can get the same error as you now. It runs fine in dbt-postgres, but gives an error in dbt-redshift. See details below for a simplified example.
|
Thanks for reporting an additional error scenario @jnapoleon 👍 Could you share the relevant example code as well so we can make sure it gets covered by any potential solution? |
I tested the dbt project described in #821 against a variety of data platforms, and only Redshift failed. So depending on the nature of the solution, we may end up transferring this issue to dbt-redshift. These all worked:
|
Hey @dbeatty10 I've attached the logs Let me know if you would rather me create a new issue? Thanks John |
Thanks for attaching your logs @jnapoleon 👍 Could you also provide a minimal code example like the "reprex" within #821 (comment)? That would allow us to determine if you are having the same issue or a different issue. |
@dbeatty10 I managed to resolve the issue, I don't think there was a bug with dbt. There was some strange behaviour getting the data types using information schema from the created temporary table. It thought the date type was varchar for the temporary table but in the real final output model it was a numeric The way I solved it was explicitly casting the fields as numeric in the final output model |
I am also seeing this. Because I'm doing cross-platform unit tests for the audit helper package, I have a diff of what snowflake generated vs redshift: When nulls are present alongside numeric values, dbt is casting everything as a varchar(1), which makes unions between expected and actual impossible. Here's the full file that ran in this CI run In my case, the input rows lead with the null but the expected output has a non-null value first. unit_tests:
- name: reworked_compare_identical_tables_multiple_null_pk
model: unit_reworked_compare
given:
- input: ref('unit_test_model_a')
rows:
- { "id": , "col1": "abc", "col2": "def" }
- { "id": , "col1": "hij", "col2": "klm" }
- { "id": 3, "col1": "nop", "col2": "qrs" }
- input: ref('unit_test_model_b')
rows:
- { "id": , "col1": "abc", "col2": "def" }
- { "id": , "col1": "hij", "col2": "klm" }
- { "id": 3, "col1": "nop", "col2": "qrs" }
expect:
rows:
- {"dbt_audit_row_status": 'identical', 'id': 3, dbt_audit_num_rows_in_status: 1}
- {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}
- {"dbt_audit_row_status": 'nonunique_pk', 'id': , dbt_audit_num_rows_in_status: 2}
overrides:
vars:
reworked_compare__columns: ['id', 'col1', 'col2']
reworked_compare__event_time:
reworked_compare__primary_key_columns: ['id'] |
I made a little reprex for Redshift 💟 This will serve as a test in the PR to be Note this does work on Postgres! -- models/staging/a.sql
{{ config(materialized="table") }}
select 1 as id, 'a' as col1
union all
select 2, 'b'
union all
select 3, 'c' -- models/staging/b.sql
{{config(materialized="table")}}
select * from {{ ref('a') }} unit_tests:
- name: test_simple
model: b
given:
- input: ref('a')
rows:
- { "id": , "col1": "d"}
- { "id": , "col1": "e"}
- { "id": 6, "col1": "f"}
expect:
rows:
- {id: , "col1": "d"}
- {id: , "col1": "e"}
- {id: 6, "col1": "f"}
|
Can't be merged without a unit test, so blocked until then. The work itself is expected to take 1 extra dev day and to be taken care of on Monday. edit: moved forward with more deeply digging into the unit test framework for Core. Code coverage concerns (plus a holiday week and a support diversion) backed this up further, but we're not unblocked and the issue is in final review. |
As we wrap up this issue -- in final review -- I want to document that there is one case we are leaving "unsolved": No single row has all non-
In this scenario, one can determine proper type for Acknowledging this limitation, we have integrated a One object you may voiceBite the bullet. To do something more sophisticated is a lift we cannot prioritize right now. The foibles of RedshiftNote, I did learn that this is a seemingly Redshift specific behavior. Other dbs like postgres and snowflake seem to have some more sophisticated type interpolation behaviors. |
Is this a new bug in dbt-core?
Current Behavior
When you run the test you will see error like this.
And here is the logs:
Input build
Expected result build
Interestingly, the incorrect casting happen in the
dbt_internal_unit_test_expected
cte but not the input cte.Our current workaround is to use sql to cast the correct type, eg.
Expected Behavior
The test runner should be able to guess the data type correctly.
Steps To Reproduce
Relevant log output
Input build
Expected result build
Interestingly, the incorrect casting happen in the
dbt_internal_unit_test_expected
cte but not the input cte.Environment
Which database adapter are you using with dbt?
redshift
Additional Context
No response
The text was updated successfully, but these errors were encountered: