-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lift + shift for cross-db macros #597
Conversation
Update 2022-08-29: See comment here for a contrarian analysis. Original post continues below. Added a functional test to compare the competing definitions for See below for the actual SQL utilized in each test. Postgresselect
now() as actual,
current_timestamp::
timestamp without time zone
as expected Redshiftselect
getdate() as actual,
getdate()
as expected Snowflakeselect
convert_timezone('UTC', current_timestamp()) as actual,
current_timestamp::
timestamp_ntz
as expected BigQueryselect
CURRENT_TIMESTAMP() as actual,
current_timestamp
as expected |
@dbeatty10 Amazing work thus far! As discussed yesterday, let's split out the One realization I had with regard to backwards compatibility, which I don't think I was clear on in #594: This approach will break, I believe, anyone who is currently using a "shim" package (e.g. # dbt_project.yml
dispatch:
- macro_namespace: dbt_utils
search_order: ['spark_utils', 'dbt_utils'] Because
{% macro any_value(expression) -%}
{{ return(adapter.dispatch('any_value', 'dbt') (expression)) }}
{% endmacro %} With this: {% macro any_value(expression) -%}
{{ return(adapter.dispatch('any_value', 'dbt_utils') (expression)) }}
{% endmacro %}
{% macro default__any_value(expression) -%}
{{ return(adapter.dispatch('any_value', 'dbt') (expression)) }}
{% endmacro %} And maybe a note at the top of every file, "This is here for backwards compatibility only"? |
Sounds great @jtcohen6 -- I will split out the In terms of backwards compatibility, I'm most attracted any solution that preserves full backwards compatibility without users or shim maintainers needing to do anything. So I'll take a swing at implementing the approach you described in the 3rd option for amelioration. If we want, we can also add a deprecation warning so we have the option to fully remove this boilerplate in a future major release. |
@jtcohen6 could I get your help brainstorming what could be going wrong here? All functional tests are working. But the dbt-utils OG integration tests are not working (for Redshift specifically). Overview of failure on CircleCIAfter setting up the dispatch train, all adapters are passing the OG integration tests... except for Redshift Here's the error: Steps to reproduce
Working hypothesisMaybe a combination of nested multiple dispatch + non-static parsing + where resource replacement is overwriting or misplacing dbt-core's Possibly unrelated context:
What is different about dbt-redshiftThe Redshift adapter is trying to explicitly reference the default implementation of {#-- redshift should use default instead of postgres --#}
{% macro redshift__dateadd(datepart, interval, from_date_or_timestamp) %}
{{ return(dbt.default__dateadd(datepart, interval, from_date_or_timestamp)) }}
{% endmacro %} Troubleshooting assetsStandard out
Jinja state at error timeIt is looking for a key named The error is thrown from here. Here are the values:
LogsTo gain better visibility within the call stack (namely source repo and dispatched macro name), I added logging to the {% macro dateadd(datepart, interval, from_date_or_timestamp) %}
{{ log("dbt-utils dateadd", True) }}
{{ return(adapter.dispatch('dateadd', 'dbt_utils')(datepart, interval, from_date_or_timestamp)) }}
{% endmacro %}
Stack traceI commented out this section within
|
@dbeatty10 Oy, this gets deep quickly. I stuck a few debuggers in a few places, and came up with strong evidence that this is indeed a circular dependency. Some printing from the end of this method:
That first bit doesn't look quite right, but everything else does. I'd expect Well... I think this is because we have special macro resolution rules for "internal" macros ( Ultimately, instead of returning We could go very deep on this logic... or we could just duplicate the
That "just works." |
Thank you for the explanation and examples, @jtcohen6! As an asideDo you interpret this as an issue? i.e., is it worth creating a ticket in dbt-core that describes this (undersireable?) behavior when mixing dispatch to an internal and non-internal package? My next actionsAs you suggested, I will copy-paste all the relevant logic into the Redshift adapter. Then all the tests should pass and then we'll be safe to merge this batch of pull requests. |
It's a very niche edge case, but it all adds up to some unexpected behavior. I think an issue for it is totally justified. That could be me or could be you, let's take it as a TODO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Comparison, revisitedIn this comment, I showed the results of a functional test to comparing the competing definitions for The method used to obtain this result always made me feel uneasy -- just because it didn't find any differences doesn't guarantee that there weren't any differences! 😅 Since @colin-rogers-dbt is digging into an implementation for dbt-labs/dbt-core#5521, I revisited this comparison and took a different approach. Approach
Semantic data typesSnowflake has 3 different timestamp data types that are (generally*) representative of the observed behaviors across different databases: At risk of further confusion (and introducing a 15th competing standard), I will use the following nomenclature instead of the Snowflake-specific counterpart:
ResultsTakeaways
The best way to verify these results intra-database would be to materialize tables with the results and then query the datatypes. Being able to automatically discover the semantics of different data types across databases (based on behaviors) will require a bit more ingenuity. * From the best I can tell, there are 3 different alternatives for how to think about the
Would love to be shown the light by anyone that knows for certain. Regardless of which of the three is most accurate, it doesn't affect the takeaways above. Edit: session timestamp seems the most likely for BigQuery since it "represents an absolute point in time, independent of any time zone or convention such as Daylight Savings Time". That implies that it doesn't store any UTC offset the way that offset timestamps do. Civil timestamps don't represent absolute points in time, so it isn't that either. Of the 3 types outlined above, that leaves only session timestamp remaining, which is the non-offset option for absolute point in time in the ontology above. Updated the categorization of BigQuery in the "Results" section above accordingly. |
Continues #594
This is a:
Description & motivation
See:
To do before merge
dev-requirements.txt
back to originaldbt_utils.current_timestamp()
before and after (for each database)@pytest.mark
current_timestamp()
Database adapters already had a previous definition of
current_timestamp()
, and it is often expressed differently than within dbt-utils. I worry that this could lead to a breaking change, so we should determine if they are logically equivalent or not.We have some options for how to handle this:
macros/cross_db_utils
folder.Comparisons
See below for a comparison between the existing adapter definition vs. the dbt-utils definition.
Default
Adapter:
core/dbt/include/global_project/macros/adapters/freshness.sql
dbt-utils:
cross_db_utils/current_timestamp.sql
Postgres
Adapter:
plugins/postgres/dbt/include/postgres/macros/adapters.sql
dbt-utils:
cross_db_utils/current_timestamp.sql
Redshift
Adapter:
dbt/include/redshift/macros/adapters.sql
dbt-utils:
cross_db_utils/current_timestamp.sql
Snowflake
Adapter:
dbt/include/snowflake/macros/adapters.sql
dbt-utils:
cross_db_utils/current_timestamp.sql
BigQuery
Adapter:
dbt/include/bigquery/macros/adapters.sql
dbt-utils:
cross_db_utils/current_timestamp.sql
{% macro bigquery__current_timestamp() %} current_timestamp {% endmacro %}