Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between behaviour of already_exists and get_relation #1683

Closed
1 task done
sumit0k opened this issue Aug 14, 2019 · 3 comments · Fixed by #1770
Closed
1 task done

Difference between behaviour of already_exists and get_relation #1683

sumit0k opened this issue Aug 14, 2019 · 3 comments · Fixed by #1770
Labels
bug Something isn't working

Comments

@sumit0k
Copy link

sumit0k commented Aug 14, 2019

Describe the bug

I am upgrading DBT from 0.12.2 to 0.14
Based on this link,
already_exists is deprecated and will be removed in favour of get_relation.

But I am seeing different behaviour of get_relation.
I am using this macro, and this runs at the start and end of table creation.

{% if not adapter.get_relation(target.dbname, this.schema, this.table)%}
    -- currently writing -1 to denote, macro has not run
    {{return(-1)}}
  {% else %}
    -- run count rows query
    {% call statement('rows_count_action', fetch_result=True) %}
        SELECT
            COUNT(*) AS rows_count
        FROM
            {{this}}
        {% if where_filter %}
        -- where filter only if where clause is present
        WHERE
            {{where_filter}}
        {% endif %}
    {% endcall %}

After table is created successfully in run and when I am running get_relation in a post-hook by above macro, it is giving me none value.

And when I run this in pre-hook, DBT is executing the else part even if the table does not exist and it throws an error relation does not exist in SELECT query.

Steps To Reproduce

Models I am seeing this are custom materialized, which is based on table materialization. This problem is not happening in table materialization.

here is the code for the same.

{% materialization pm_transient, default %}
    {# table identifiers #}
    {%- set identifier = model['alias'] -%}
    {%- set tmp_source_suffix = '__dbt_source_tmp' -%}
    {%- set tmp_processed_suffix = '__dbt_incremental_tmp_' -%}

    {%- set target_relation = api.Relation.create(database=database, identifier=identifier, schema=schema, type='table') -%}
    {%- set tmp_source_relation = make_temp_relation(target_relation, tmp_source_suffix) -%}
    {%- set tmp_processed_relation = make_temp_relation(target_relation, tmp_processed_suffix) -%}

    -- drop the relations if they exists for some reason
    {{ adapter.drop_relation(target_relation) }}
    {{ adapter.drop_relation(tmp_source_relation) }}
    {{ adapter.drop_relation(tmp_processed_relation) }}

    {{ run_hooks(pre_hooks, inside_transaction=False) }}

    -- `BEGIN` happens here:
    {{ run_hooks(pre_hooks, inside_transaction=True) }}

    {%- call statement('source_table_creation') -%}
        {# Directly creating table from SQL to support Spectrum #}
        {{ dbt.create_table_as(False, tmp_source_relation, sql) }}
    {%- endcall -%}

    {{ adapter.drop_relation(tmp_processed_relation) }}
    {{ adapter.rename_relation(tmp_source_relation, tmp_processed_relation) }}

    {% call statement('main') %}
        {% set processed_table_sql -%}
            SELECT
                *
            FROM
                {{ tmp_processed_relation }}
        {%- endset %}

        -- Prepare temp table from scd logic
        {{ dbt.create_table_as(False, target_relation, processed_table_sql) }}
    {% endcall %}

    {{ run_hooks(post_hooks, inside_transaction=True) }}

    -- `COMMIT` happens here
    {{ adapter.commit() }}

    {{ drop_relation_if_exists(tmp_processed_relation) }}

    {{ run_hooks(post_hooks, inside_transaction=False) }}
{%- endmaterialization %}

Create a model using the above materialization and run a macro mentioned above in the post as post-hook. If it gives -1 in audit table or in log that means get_relation is not working.

Expected behavior

This used to work previously with 0.12.2 giving correct presence of table in database. get_relation is expected to behave similar to already_exists as it is the replacement.

Screenshots and log output

This is my post hook in dbt_project.yml file

    post-hook:
      - "{{ logging.log_model_end_event() }}"
      - sql: >-
          INSERT INTO
              {{target.schema}}.dbt_audit
              VALUES (
                  'completed model deployment'
                  , GETDATE()
                  , '{{this.schema}}'
                  , '{{this.table}}'
                  , '{{run_started_at}}'
                  , '{{invocation_id}}'
                  , {{get_count()}}
                  , '{{var('run_type')}}'
              )
        transaction: False

You can see in the first line 10 rows were inserted, but in INSERT INTO query where -1 comes from the macro mentioned above I am getting -1 which means table does not exist.

2019-08-14 16:02:35,451 (Thread-1): SQL status: INSERT 0 10 in 0.32 seconds
...
2019-08-14 16:02:38,548 (Thread-1): Using redshift connection "dim_api_version".
2019-08-14 16:02:38,548 (Thread-1): On dim_api_version: 
        commit;
      
2019-08-14 16:02:39,060 (Thread-1): SQL status: COMMIT in 0.51 seconds
2019-08-14 16:02:39,072 (Thread-1): Using redshift connection "dim_api_version".
2019-08-14 16:02:39,072 (Thread-1): On dim_api_version: 
      INSERT INTO
    analytics_dbt_test.dbt_audit
    VALUES (
        'completed model deployment'
        , GETDATE()
        , 'analytics_dbt_test'
        , 'dim_api_version'
        , '2019-08-14 10:32:00.717778+00:00'
        , '1234f41f-2a11-44d6-b265-9942bfc8b102'
        , -1
        , 'not_defined'
    )

System information

Which database are you using dbt with?

  • redshift

The output of dbt --version:

installed version: 0.14.0
   latest version: 0.14.0

Up to date!

The operating system you're using:
macOS Mojave 10.14.5
The output of python --version:
Python 3.7.0

@sumit0k sumit0k added bug Something isn't working triage labels Aug 14, 2019
@drewbanin
Copy link
Contributor

Hey @sumitkumar1209 - I can see in your materialization that you're running a setup step like:

    {{ adapter.drop_relation(target_relation) }}

When this happens, dbt does two things:

  1. it drops the physical table in the database (before the transaction begins! This is dangerous!)
  2. it updates dbt's cache to understand that the target_relation no longer exists in the database

dbt uses a cache to help reduce the number of information_schema queries it needs tor run. We recommend switching from methods like already_exists to alternatives like is_incremental() or get_relation() because they make use of the cache, and so should be significantly faster.

In this case, I think your call to drop the target_relation is leading to a cache inconsistency. dbt thinks the relation has been dropped, but this materialization is re-creating it.

In dbt, we account for this by re-caching the new relation after the materialization finishes running. You can see that happen here: https://github.com/fishtown-analytics/dbt/blob/dev/0.14.1/core/dbt/node_runners.py#L351

This happens after the materialization completes, so introspective queries that use the cache in post-hooks will use the cache while it's in an incorrect state.

One option here may be to change the calls from adapter.drop_relation to plain SQL that runs drop table if exists .... This will avoid mutating the cache.

For our part, we should probably expose some ability to modify the cache from within a materialization. That might look like a call to

    {% call statement('main') %}
        {% set processed_table_sql -%}
            SELECT
                *
            FROM
                {{ tmp_processed_relation }}
        {%- endset %}

        -- Prepare temp table from scd logic
        {{ dbt.create_table_as(False, target_relation, processed_table_sql) }}
    {% endcall %}

    -- this is the thing we may want to add to dbt
    {{ adapter.cache.add(target_relation) }}

By doing this, you'll inform dbt's cache that you've created a new relation. Conceivably, we could bake this right into the create_table_as logic instead.

Thanks for writing this up! it's an interesting issue for sure! Let me know what you think

@sumit0k
Copy link
Author

sumit0k commented Aug 15, 2019

Hi @drewbanin,

Keeping this code in context https://github.com/fishtown-analytics/dbt/blob/dev/0.14.1/core/dbt/node_runners.py#L351,

Is adapter mentioned in the above line same as what we use in macro? In other words can I directly use adapter.cache_new_relation(relation)?

Your suggestions are awesome. I will prefer the cache to be updated in create_table_as if possible, as it will help us keep one thing less in mind while writing custom materializations.

@drewbanin
Copy link
Contributor

yep, totally agree. Unfortunately, the cache property isn't currently exposed in the adapter object. This is something that we'd need to address in a future release of dbt

@drewbanin drewbanin removed the triage label Aug 15, 2019
@drewbanin drewbanin modified the milestones: 0.14.1, Louisa May Alcott Aug 15, 2019
beckjake added a commit that referenced this issue Sep 23, 2019
…-to-materializations

Expose the cache in dbt's rendering contexts (#1683)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants