Add snapshot support #66

tongqqiu · 2020-03-20T23:48:51Z

cast as string (the base on use varchar, which is not working on spark)
snapshot only works on delta format
slightly change the based implementation to fit delta merging needs.

- making snapshot working with delta lake

…pr/support_snapshot # Conflicts: # dbt/adapters/spark/impl.py # dbt/include/spark/macros/materializations/seed.sql

tongqqiu · 2020-03-20T23:50:01Z

@beckjake @jtcohen6 Try to make snapshot working on the delta format

jtcohen6 · 2020-03-23T23:35:47Z

@tongqqiu This is really cool! I'm going to take your code for a spin over the next few days. It would be awesome to ship this as part of a 0.16.0 release.

jtcohen6 · 2020-03-26T18:16:06Z

There's two categories of code addition going on here:

Obvious adapter-specific reimplementation. dbt-core has created space for these by use of adapter macros.
- spark__snapshot_hash_arguments—all set
- spark__snapshot_string_as_time—need to add this to the PR, here's my version:

{% macro spark__snapshot_string_as_time(timestamp) -%}
    {%- set result = "to_timestamp('" ~ timestamp ~ "')" -%}
    {{ return(result) }}
{%- endmacro %}

Areas where dbt's core implementation does not work on Spark by default, and has not enabled easy override of targeted functions by way of adapter macros.

Specific problem

There are two issues with the insert that's happening in dbt's core implementation here:

On Spark, insert into does not take a list of columns for insertion
Spark doesn't have temp tables, it only has temp views. Temp views cannot be inserted into or described

The solution in this PR is to create two separate temp tables, one for updates and one for inserts, and then perform two merge statements (one for each).

While I can't determine the exact implementation right now, if we're already overwriting a bunch of code, I wonder if we don't instead create a unioned table of all updates + inserts, and then perform a single (atomic) merge statement that looks a lot like the default implementation:

merge into {{ target }} as DBT_INTERNAL_DEST
    using {{ source.include(schema=false) }} as DBT_INTERNAL_SOURCE
    on DBT_INTERNAL_SOURCE.dbt_scd_id = DBT_INTERNAL_DEST.dbt_scd_id
    
    when matched
     and DBT_INTERNAL_DEST.dbt_valid_to is null
     and DBT_INTERNAL_SOURCE.dbt_change_type = 'update'
        then update
        set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to

    when not matched
     and DBT_INTERNAL_SOURCE.dbt_change_type = 'insert'
        then insert *

General problem

How can we override these macros? build_snapshot_table, snapshot_staging_table_updates, snapshot_staging_table_inserts. Simply creating a macro in dbt-spark with the same name as a macro in dbt-core results in

Running with dbt=0.16.0
Encountered an error:
'ParsedMacro' object has no attribute 'namespace'

Right now, the only sane way to do this is by adding a prefix (e.g. spark__) to every one of these macros, such that they have a different name from a macro defined in core.

I'd be interested to talk through a better solution for overriding parts (but not all) of the snapshot macro-stack, working toward something that could be more easily generalized to other plugins.

Working version

I was able to get the code in this PR working on 0.16.0 by:

merging the changes from master
adding spark__snapshot_string_as_time as above
renaming all the macros that have namespace collisions with dbt-core

I think, no matter how we implement this, we should ensure that the user knows early and often this is Delta-only functionality. Part of that looks like updating the README, part of it looks like raising a compilation error if the snapshot's config.file_format != 'delta'.

beckjake · 2020-03-26T18:35:47Z

Suggestion: Name spark-unique macros that aren't intentionally overriding a core adapter_macro with spark_ instead of spark__ (one underscore, not two) to avoid getting accidentally picked up by core if it ever changes.

jtcohen6 · 2020-04-07T15:14:34Z

@tongqqiu Are you up for making the few small changes to this PR required to get it working? I believe it would just involve:

Merging changes from master
Implementing spark__snapshot_string_as_time
Prepending macro names with spark_ if they have dbt-core namespace collisions

@beckjake As far as how to add a test for Spark snapshots, I thinking that for now it's something we could add to the spark-support branch of dbt-integration-tests.

tongqqiu · 2020-04-08T22:07:08Z

@jtcohen6 should you help with those changes?

jtcohen6 · 2020-04-15T18:35:10Z

@jtcohen6 should you help with those changes?

I'm happy to take it from here, if that's okay with you!

tongqqiu · 2020-04-15T19:56:09Z

@jtcohen6 yes. That would be great!

Tony Qiu added 4 commits March 2, 2020 16:06

- add c.name None check

35d4a50

- making snapshot working with delta lake

Fix the seed bug

32cdb40

Merge remote-tracking branch 'remotes/origin/pr/0.15.3_upgrade' into …

d89c291

…pr/support_snapshot # Conflicts: # dbt/adapters/spark/impl.py # dbt/include/spark/macros/materializations/seed.sql

remove unused function

452c2cb

beckjake force-pushed the pr/0.15.3_upgrade branch from a24e4e3 to 9b13d12 Compare March 23, 2020 14:51

jtcohen6 changed the base branch from pr/0.15.3_upgrade to master March 23, 2020 23:36

jtcohen6 mentioned this pull request Apr 8, 2020

Support delta lake format #30

Closed

jtcohen6 mentioned this pull request Apr 21, 2020

Feature/delta snapshots #76

Merged

1 task

jtcohen6 closed this in #76 May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add snapshot support #66

Add snapshot support #66

tongqqiu commented Mar 20, 2020

tongqqiu commented Mar 20, 2020

jtcohen6 commented Mar 23, 2020

jtcohen6 commented Mar 26, 2020 •

edited

Loading

beckjake commented Mar 26, 2020

jtcohen6 commented Apr 7, 2020

tongqqiu commented Apr 8, 2020

jtcohen6 commented Apr 15, 2020

tongqqiu commented Apr 15, 2020

Add snapshot support #66

Add snapshot support #66

Conversation

tongqqiu commented Mar 20, 2020

tongqqiu commented Mar 20, 2020

jtcohen6 commented Mar 23, 2020

jtcohen6 commented Mar 26, 2020 • edited Loading

Specific problem

General problem

Working version

beckjake commented Mar 26, 2020

jtcohen6 commented Apr 7, 2020

tongqqiu commented Apr 8, 2020

jtcohen6 commented Apr 15, 2020

tongqqiu commented Apr 15, 2020

jtcohen6 commented Mar 26, 2020 •

edited

Loading