Feature/archive blocks #1361

beckjake · 2019-03-21T14:51:11Z

Fixes #1175
Fixes #251
Fixes #1333
Fixes #706
Fixes #1167
Fixes #1066

Changes in this branch:

implements an archive block parser
slightly changes how the GraphLoader works for sql nodes
passes flake8
passes at all integration tests
supports both old-style archive blocks in dbt_project.yml and a new archive-paths that defaults to ["archives"]. The contents of archive-paths should be .sql files that contain archive blocks
it seems to me like the target database/schema should be optional, and default to the connection settings. But maybe that's wrong. For now it's required, though see above bullet.
I guess I technically have not tested macros but since archives use macros and aren't special anymore, I am pretty confident that calling them will work!
Added ref'ing support for archives + tests
Added explicit tests of the 'archives as a query' behavior
Added support for model selection syntax on archives + tests
Added a more helpful error when an old-style existing archive table is found (valid_to instead of dbt_valid_to, etc)

On missing config parameters to a dbt archive named {% archive foo %}...:

$ cat archives/test.sql
{% archive foo %}
    {{ config(target_schema=schema, unique_key='order_id', strategy='timestamp', updated_at='order_date') }}
    select * from {{ ref('fct_orders') }} where status = 'completed'
{% endarchive %}
$
$ dbt compile
Running with dbt=0.13.0
Encountered an error:
Compilation Error in archive foo (archives/test.sql)
  Runtime Error
    Invalid arguments passed to "ParsedArchiveNode" instance: config.'target_database' is a required property

On an invalid (old archive style) target for the archive:

$ dbt archive
Running with dbt=0.13.0
Found 8 models, 20 tests, 1 archives, 0 analyses, 98 macros, 0 operations, 3 seed files, 0 sources

11:41:54 | Concurrency: 2 threads (target='default')
11:41:54 |
11:41:54 | 1 of 1 START archive foo --> postgres.dbt_postgres.foo............... [RUN]
11:41:54 | 1 of 1 ERROR archiving foo --> postgres.dbt_postgres.foo............. [ERROR in 0.14s]
11:41:55 |
11:41:55 | Finished running 1 archives in 0.53s.

Completed with 1 errors:

Compilation Error in archive foo (archives/test.sql)
  Archive target has ("scd_id", "valid_from", "valid_to") but not ("dbt_scd_id", "dbt_valid_from", "dbt_valid_to") - is it an unmigrated previous version archive?

  > in macro materialization_archive_default (macros/materializations/archive/archive.sql)
  > called by archive foo (archives/test.sql)

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

Currently, the target_table field is not supported - instead you can use the name of your archive block or set an alias in your config() call. I think that's what we should do in general but I'm more than happy to discuss/change it.

beckjake · 2019-04-09T16:01:30Z

/azp run

drewbanin

Some initial comments! I plan on coming back here specifically to look at the Archive materialization implementation. Really nice stuff and glad that you snuck flake8 in here

drewbanin · 2019-04-12T20:39:42Z

core/dbt/adapters/base/relation.py

@@ -303,3 +303,91 @@ def is_cte(self):
    @property
    def is_view(self):
        return self.type == self.View
+
+
+class Column(object):


Do you think there's any merit to putting this in its own file?

I don't think it really matters either way.

drewbanin · 2019-04-12T20:40:25Z

core/dbt/api/object.py

@@ -1,6 +1,6 @@
 import copy
 from collections import Mapping
-from jsonschema import Draft4Validator
+from jsonschema import Draft7Validator


What's this all about?

I changed this because I wanted to use some features that changed between draft 4 and draft 7 (jsonschema has its own ref mechanism involving schema IDs), but then I reverted those because it got out of hand on the contracts side. I can change it back but I figured it wouldn't hurt to leave it.

Also: It would be cool to annotate our contracts the "proper" jsonschema way. It will be a big PR, but then we could use refs properly. The reason I reverted it was because our current way of combining contracts (deep_merge, etc) doesn't play well with the schema id/ref model jsonschema provides, and changing all that would massively increase the size and risk of this PR, which is already kind of out of hand for my taste.

drewbanin · 2019-04-12T20:49:45Z

core/dbt/include/global_project/macros/materializations/archive/archive.sql

-{% macro default__archive_scd_hash() %}
-    md5("dbt_pk" || '|' || "dbt_updated_at")
+{% macro default__archive_hash_arguments(args) %}
+    md5({% for arg in args %}{{ arg }}{% if not loop.last %} || '|' || {% endif %}{% endfor %})


this is cool :)

Two things:

|| isn't implemented in BQ (and conceivably other dbs)

You need to be careful when concatenating columns, as nulls will propagate. A single null column will make the whole expression null!

We'll basically want to implement the logic shown here: https://github.com/fishtown-analytics/dbt-utils/blob/master/macros/sql/surrogate_key.sql

What do you think would be a good mechanism for sharing this code between dbt-core and dbt-utils?

Well, yeah, that's why it's an adapter macro and bigquery overrides it to use concat... it's basically the same as the old archive_scd_hash behavior but a bit more generic.

Ok, that is a problem. We should probably just pull that dbt-utils code into dbt-core and let dbt-utils call it for backwards compatibility.

Update: actually, the only thing missing from this code is the coalesce call, and it slots in nicely to the bq/default divide. That seems a lot easier than pulling that macro and its dependencies into dbt, so I'm just going to add the coalesce call.

drewbanin · 2019-04-12T20:51:06Z

core/dbt/include/global_project/macros/materializations/archive/archive.sql

 {#
    Cross-db compatible archival implementation
 #}
-{% macro archive_select(source_relation, target_relation, source_columns, unique_key, updated_at) %}
-
+{% macro archive_select_timestamp(source_sql, target_relation, source_columns, unique_key, updated_at) -%}


I need to come back and take a deeper look at this

drewbanin · 2019-04-12T20:51:55Z

core/dbt/include/global_project/macros/materializations/archive/archive.sql

+    {{ archive_select_generic(source_sql, target_relation, transforms, scd_hash) }}
+{%- endmacro %}
+
+{# this is gross #}


Agree, let me have a think about how we can do this better....

We create the empty table if it doesn't exist, then we insert into it later in the invocation. Can we just change this to use a single create table as ( .... ) statement for the first run as we do with incremental models?

Maybe? I assumed archives worked the way they do for a good reason, but I don't know what it is, so maybe they don't!

drewbanin · 2019-04-12T20:54:25Z

core/dbt/include/global_project/macros/materializations/archive/archive.sql

-  {%- set source_database = config.get('source_database') -%}
-  {%- set source_schema = config.get('source_schema') -%}
-  {%- set source_table = config.get('source_table') -%}
+  {%- set target_table = model.get('alias', model.get('name')) -%}


this is clever, but i hesitate to make archivals work too much like models. We don't want them to participate in generate_schema_name, for instance. I think it might make more sense to specify the target table name here

I have to think about this some more, but I think we really do want archives to look a lot more like models when possible, because that makes it easier to both implement and reason about their behavior. I do agree on using target_schema/target_database, but I think the table name matching the block name makes a lot of sense... models will do that in the future when we have model blocks.

drewbanin · 2019-04-15T12:48:35Z

core/dbt/node_runners.py

@@ -526,7 +524,7 @@ def __init__(self, config, adapter, node, node_index, num_nodes):

    def handle_exception(self, e, ctx):
        if isinstance(e, dbt.exceptions.Exception):
-            if hasattr(e, 'node'):
+            if isinstance(e, dbt.exceptions.RuntimeException):


is this a flake8 thing or something else?

I don't remember if it was flake8 or just hygiene but I saw an opportunity to replace a hasattr with an isinstance check, and I think it's useful to always take those if you can.

drewbanin · 2019-04-15T12:51:04Z

core/dbt/parser/base_sql.py


        for n in nodes:
            node_path, node_parsed = self.parse_sql_node(n, tags)

            # Ignore disabled nodes
            if not node_parsed.config['enabled']:
-                disabled.append(node_parsed)
+                results.disable(node_parsed)


this is slick

raise on non-archive during parsing break archive materialization

tests fix event tracking test Fix print statements make archives not inherit configs from models archive now uses the name/alias properly for everything instead of target_table skip non-archive blocks in archive parsing instead of raising make archives ref-able - test for archive ref, test for archive selects raise a more useful message on incorrect archive targets add "--models" and "--exclude" arguments to archives - pass them through to selection - change get_fqn to take a full node object, have archives use that so selection behaves well - added tests Improve error handling on invalid archive configs Added a special archive-only node that has extra config restrictions add tests for invalid archive config

Contracts: some anyOf shenanigans to add support for check_cols Macros: split apart archive selection, probably too much copy+paste Legacy: Archive configs now include a "timestamp" strategy when parsed from dbt_project.yml Add integration tests fix aliases test Unquote columns in archives handle null columns attr -> use_profile

drewbanin

This LGTM - ship it when the tests pass!

beckjake force-pushed the feature/archive-blocks-as-regex branch 2 times, most recently from 30466b2 to 78a260a Compare March 25, 2019 15:52

beckjake force-pushed the feature/archive-blocks-as-regex branch 5 times, most recently from a54f6e9 to 1030c9a Compare April 9, 2019 15:47

beckjake force-pushed the feature/archive-blocks-as-regex branch 3 times, most recently from f1a5bce to 84b785c Compare April 9, 2019 19:42

beckjake marked this pull request as ready for review April 9, 2019 19:47

beckjake force-pushed the feature/archive-blocks-as-regex branch from 43bfb3b to cdcd654 Compare April 10, 2019 17:36

beckjake mentioned this pull request Apr 10, 2019

Add the check archive strategy (#706) #1394

Merged

beckjake requested a review from drewbanin April 12, 2019 14:36

beckjake force-pushed the feature/archive-blocks-as-regex branch 2 times, most recently from f3759b4 to ff94d87 Compare April 12, 2019 16:23

beckjake mentioned this pull request Apr 12, 2019

archives: do not create existing schema (#758) #1398

Merged

drewbanin reviewed Apr 15, 2019

View reviewed changes

Jacob Beck added 7 commits April 23, 2019 14:30

fix this for real in a way that will make me not break it again

53d083e

flake8, pep8, unit tests

af8622e

archive-paths support, wire up the block parser

ab63042

raise on non-archive during parsing break archive materialization

move column-related things into adapters where they belong

2b80d7a

Update jsonschema and go from Draft 4 to Draft 7

d66584f

beckjake force-pushed the feature/archive-blocks-as-regex branch from b00c824 to 416cc72 Compare April 23, 2019 20:49

Merge branch 'dev/wilt-chamberlain' into feature/archive-blocks-as-regex

dd23259

drewbanin approved these changes Apr 25, 2019

View reviewed changes

fix some merged-in flake8 failures

1042f1a

beckjake merged commit 96cb056 into dev/wilt-chamberlain Apr 26, 2019

beckjake deleted the feature/archive-blocks-as-regex branch May 10, 2019 20:22

graciegoheen mentioned this pull request May 30, 2024

[Feature] No more jinja block for snapshots - new snapshot design ideas #10246

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/archive blocks #1361

Feature/archive blocks #1361

beckjake commented Mar 21, 2019 •

edited

Loading

beckjake commented Apr 9, 2019

drewbanin left a comment

drewbanin Apr 12, 2019

beckjake Apr 15, 2019

drewbanin Apr 12, 2019

beckjake Apr 15, 2019

beckjake Apr 16, 2019

drewbanin Apr 12, 2019

beckjake Apr 15, 2019

beckjake Apr 18, 2019

drewbanin Apr 12, 2019

drewbanin Apr 12, 2019

beckjake Apr 15, 2019

drewbanin Apr 12, 2019

beckjake Apr 15, 2019 •

edited

Loading

drewbanin Apr 15, 2019

beckjake Apr 15, 2019

drewbanin Apr 15, 2019

drewbanin left a comment

Feature/archive blocks #1361

Feature/archive blocks #1361

Conversation

beckjake commented Mar 21, 2019 • edited Loading

beckjake commented Apr 9, 2019

drewbanin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckjake Apr 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewbanin left a comment

Choose a reason for hiding this comment

beckjake commented Mar 21, 2019 •

edited

Loading

beckjake Apr 15, 2019 •

edited

Loading