New command: `dbt clone` #7258

jtcohen6 · 2023-04-01T17:37:26Z

resolves #7256

$ dbt clone --state state/ --full-refresh

Description

Introduce clone command, which also makes use of a clone materialization.

On data platforms that support create table <dev> clone <prod>, let's use it
Otherwise, create views that are simple pointers (create view <dev> as select * from <prod>)
Don't recreate objects that already exist in the target schemas, unless --full-refresh is passed

Some interesting behaviors around:

accessing the --state manifest in a way that's similar to, but not exactly the same as, --defer
caching: we want to cache the other/prod schemas, in addition to the dev/target ones

TODOs

Drop some comments below where I've left TODOs, or made some more inspired choices
Timing comparison of dbt clone --threads 50 --full-refresh (~3.5 minutes) versus create schema <dev> clone <prod> (~10 min) using our real Snowflake project with ~1k models in it. That's ~65% faster!
Add functional tests :) — in progress!

Example

I have a seed, a view model, and a table model.

From logs/dbt.log:

19:07:08.285835 [debug] [Thread-1  ]: Using snowflake connection "model.test.another_model"
19:07:08.286490 [debug] [Thread-1  ]: On model.test.another_model: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.test.another_model"} */
create or replace
      transient
      table analytics.dbt_jcohen.another_model
      clone analytics.dbt_jcohen_prod.another_model
19:07:08.287087 [debug] [Thread-2  ]: Using snowflake connection "model.test.my_model"
19:07:08.287513 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
19:07:08.287933 [debug] [Thread-2  ]: On model.test.my_model: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.test.my_model"} */
create or replace   view analytics.dbt_jcohen.my_model
  
   as (
    
        select * from analytics.dbt_jcohen_prod.my_model
    
  );
19:07:08.288501 [debug] [Thread-3  ]: Using snowflake connection "seed.test.my_seed"
19:07:08.289204 [debug] [Thread-2  ]: Opening a new connection, currently in state closed
19:07:08.289592 [debug] [Thread-3  ]: On seed.test.my_seed: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "seed.test.my_seed"} */
create or replace
      transient
      table analytics.dbt_jcohen.my_seed
      clone analytics.dbt_jcohen_prod.my_seed
19:07:08.290207 [debug] [Thread-3  ]: Opening a new connection, currently in state init
19:07:09.648042 [debug] [Thread-2  ]: SQL status: SUCCESS 1 in 1.0 seconds
19:07:09.669806 [debug] [Thread-2  ]: Timing info for model.test.my_model (execute): 19:07:08.250276 => 19:07:09.669642
19:07:09.670302 [debug] [Thread-2  ]: On model.test.my_model: Close
19:07:10.163990 [debug] [Thread-2  ]: Finished running node model.test.my_model
19:07:10.267498 [debug] [Thread-1  ]: SQL status: SUCCESS 1 in 2.0 seconds
19:07:10.273490 [debug] [Thread-1  ]: Timing info for model.test.another_model (execute): 19:07:08.231088 => 19:07:10.273041
19:07:10.274599 [debug] [Thread-1  ]: On model.test.another_model: Close
19:07:10.411055 [debug] [Thread-3  ]: SQL status: SUCCESS 1 in 2.0 seconds
19:07:10.415530 [debug] [Thread-3  ]: Timing info for seed.test.my_seed (execute): 19:07:08.275505 => 19:07:10.415249

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR — in progress!
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

jtcohen6 · 2023-04-01T17:39:39Z

core/dbt/include/global_project/macros/materializations/models/clone.sql

+
+{% macro snowflake__get_clone_table_sql(this_relation, state_relation) %}
+    create or replace
+      {{ "transient" if config.get("transient", true) }}


Snowflake requires that table clones match the transience/permanence of the table they're cloning.

We determine if the table is a table based on the cache result from the other/prod schema, but here we're just using the current (dev) configuration for transient. There's the possibility of a mismatch if a user has updated the transient config in development.

jtcohen6 · 2023-04-02T14:06:10Z

core/dbt/context/providers.py

@@ -1362,6 +1362,20 @@ def this(self) -> Optional[RelationProxy]:
            return None
        return self.db_wrapper.Relation.create_from(self.config, self.model)

+    @contextproperty
+    def state_relation(self) -> Optional[RelationProxy]:


Open to naming suggestions! This will only be available in the context for the clone command currently

Is "relation" being used as a byword for table/view. Relation convene, so many things, and I feel like we want an additional term or a better term that can be more specific about what you're trying to achieve. Perhaps "stateful_db_relation". I really don't know the subtleties here though, so you might have picked the best one.

jtcohen6 · 2023-04-02T14:07:25Z

core/dbt/contracts/graph/manifest.py

+                state_relation = RelationalNode(
+                    other_node.database, other_node.schema, other_node.alias
+                )
+                self.nodes[unique_id] = current.replace(state_relation=state_relation)


We're storing information about each node's production-state counterpart, right on the node entry in the manifest. I'm open to discussing whether this is the right approach. It feels better than passing the entire other manifest around into other methods/runners.

jtcohen6 · 2023-04-02T14:08:55Z

core/dbt/contracts/graph/nodes.py

@@ -567,6 +571,7 @@ class HookNode(CompiledNode):
 class ModelNode(CompiledNode):
    resource_type: NodeType = field(metadata={"restrict": [NodeType.Model]})
    access: AccessType = AccessType.Protected
+    state_relation: Optional[RelationalNode] = None


Given the approach of storing stateful information on each node about its prod counterpart, we do need a place on the node object to do that. For now, I'm adding a new attribute (default None) to models, seeds, and snapshots — the three refable nodes that are eligible for deferral & cloning.

jtcohen6 · 2023-04-02T14:10:34Z

core/dbt/include/global_project/macros/materializations/models/clone.sql

+{% macro can_clone_tables() %}
+    {{ return(adapter.dispatch('can_clone_tables', 'dbt')()) }}
+{% endmacro %}
+
+
+{% macro default__can_clone_tables() %}
+    {{ return(False) }}
+{% endmacro %}
+
+
+{% macro snowflake__can_clone_tables() %}
+    {{ return(True) }}
+{% endmacro %}


This kind of True/False conditional behavior might be better as an adapter property/method (Python), since it's really a property of the adapter / data platform, rather than something a specific user wants to reimplement. Comparable to the "boolean macros" we defined for logic around grants. Open to thoughts!

Just to state the obvious: any snowflake__ macros would want to move to dbt-snowflake as part of implementing & testing this on our adapters! I'm just defining them here for now for the sake of comparison & convenience (= laziness)

I concur with both comments

jtcohen6 · 2023-04-02T14:28:15Z

core/dbt/task/runnable.py

+            if get_flags().CACHE_SELECTED_ONLY is True:
+                required_schemas = self.get_model_schemas(adapter, selected_uids)
+                self.populate_adapter_cache(adapter, required_schemas)
+            else:
+                self.populate_adapter_cache(adapter)


As in RunTask.before_run (comment above)

jtcohen6 · 2023-04-02T14:29:26Z

tests/functional/defer_state/test_defer_state.py

+        # TODO: need an "adapter zone" version of this test that checks to see
+        # how many of the cloned objects are "pointers" (views) versus "true clones" (tables)
+        # e.g. on Postgres we expect to see 4 views
+        # whereas on Snowflake we'd expect to see 3 cloned tables + 1 view


Either this test, or some extension of it, should be in the "adapter zone" so that we can verify this functionality:

On data platforms that support create table <dev> clone <prod>, let's use it

Otherwise, create views that are simple pointers (create view <dev> as select * from <prod>)

jtcohen6 · 2023-04-02T14:30:27Z

core/dbt/include/global_project/macros/materializations/models/clone.sql

+  -- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table
+  -- Otherwise, this will be a view


Data platforms that support table cloning:

Snowflake (docs)

BigQuery (docs)

Databricks (docs, with two modes: "shallow" (zero-copy) and "deep" (full copy)

Data platforms that don't:

Postgres

Redshift

Trino

jtcohen6 · 2023-04-02T14:30:37Z

core/dbt/include/global_project/macros/materializations/models/clone.sql

+      {% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %}
+      {% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %}
+      {% do persist_docs(target_relation, model) %}


I'm thinking we should still apply grants & table/column-level comments. I have a suspicion that whether these things are copied over, during cloning, varies by data platform; I should really look to confirm/reject that suspicion. It's also possible that the user has defined conditional logic for these that differs between dev & prod, especially grants.

jtcohen6 · 2023-04-02T14:37:46Z

core/dbt/contracts/graph/manifest.py

+            if "state_relation" in node:
+                del node["state_relation"]


It doesn't feel like state_relation is a thing we should need/want to include in serialized manifest.json. It's really just for our internal use.

sungchun12 · 2023-04-05T16:21:19Z

Holy crap. Jerco and team, how are you shipping soooo much value right now?!?!

jtcohen6 · 2023-04-10T09:09:57Z

A decent amount of time is spent checking/updating the cache, which locks across all the concurrent threads. This could be even faster (2 min for 1k models) with the change proposed in #6844.

I think we'd want to modify the behavior of run_queue for this command, such that it:

Does not skip on the first failure
Does not require running nodes in graph order, or skipping nodes downstream of any failed nodes

epapineau · 2023-04-19T19:25:40Z

Candidate for favorite PR of the year 👏🏻

VersusFacit · 2023-05-24T03:08:27Z

core/dbt/task/run.py

@@ -444,7 +445,10 @@ def before_run(self, adapter, selected_uids: AbstractSet[str]):
        with adapter.connection_named("master"):


probably should not be master if we're trying to avoid that language elsewhere

VersusFacit · 2023-05-24T03:11:14Z

core/dbt/context/providers.py

+        """
+        For commands which add information about this node's corresponding
+        production version (via a --state artifact), access the Relation
+        object for that stateful other


other feels vague (well, really, just Latinate/French in phrasing :) ). "source node", etc.?

VersusFacit · 2023-05-24T03:13:18Z

core/dbt/include/global_project/macros/materializations/models/clone.sql

+
+
+{% macro default__get_pointer_sql(to_relation) %}
+    {% set pointer_sql %}


I feel uncomfortable with this use of pointer. I'm not exactly sure why were drawing that comparison, when perhaps we should use something closer to reference or allude to the fact that there's a shadow copy. In my mind pointer is just a very specific thing and somewhat anachronistic in a SQL context.

jtcohen6 · 2023-06-16T17:12:45Z

closing in favor of #7881 :)

jtcohen6 added 2 commits April 1, 2023 19:35

Propose new 'clone' command

ce1dc81

Add changelog entry

7e94302

cla-bot bot added the cla:yes label Apr 1, 2023

jtcohen6 added 2 commits April 2, 2023 16:01

Add test, fix failures

df4ff4c

More test fixups

2543d5d

jtcohen6 commented Apr 2, 2023

View reviewed changes

Add logging, don't raise on first error

5f08d70

jtcohen6 mentioned this pull request Apr 10, 2023

[CT-2348] [Feature] dbt clone command #7256

Closed

3 tasks

jtcohen6 mentioned this pull request Apr 21, 2023

[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev #7432

Open

stu-k mentioned this pull request May 16, 2023

Add "other" relation to reffable node classes #7645

Merged

6 tasks

gshank added 2 commits May 16, 2023 16:00

Merge branch 'main' into jerco/clone-command

a08eea3

use StateRelation instead of RelationalNode

0b25b9c

VersusFacit reviewed May 24, 2023

View reviewed changes

jtcohen6 mentioned this pull request May 26, 2023

[CT-466] [Feature] Make run-operation accept selectors to be able to use the selected_resources Jinja variable #5005

Open

1 task

jtcohen6 mentioned this pull request Jun 6, 2023

[SPIKE] [CT-2650] Materialization macros should support dispatch #7799

Closed

aranke force-pushed the jerco/clone-command branch from 737c7c4 to 0b25b9c Compare June 15, 2023 03:05

aranke mentioned this pull request Jun 16, 2023

dbt clone #7881

Merged

6 tasks

jtcohen6 closed this Jun 16, 2023

McKnight-42 mentioned this pull request Jun 30, 2023

dbt_clone macros, materialization and tests to dbt-spark dbt-labs/dbt-spark#816

Merged

6 tasks

aranke deleted the jerco/clone-command branch July 20, 2023 14:45

jtcohen6 mentioned this pull request Aug 23, 2023

[CT-2723] [spike+] Maximally parallelize dbt clone operations, a different mechanism for processing a queue #7914

Closed

nszoni mentioned this pull request Feb 20, 2024

v1.6.0rc1 microsoft/dbt-synapse#202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New command: `dbt clone` #7258

New command: `dbt clone` #7258

jtcohen6 commented Apr 1, 2023 •

edited

Loading

jtcohen6 Apr 1, 2023

jtcohen6 Apr 2, 2023

VersusFacit May 24, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

VersusFacit May 24, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

jtcohen6 Apr 2, 2023

sungchun12 commented Apr 5, 2023

jtcohen6 commented Apr 10, 2023

epapineau commented Apr 19, 2023

VersusFacit May 24, 2023

VersusFacit May 24, 2023

VersusFacit May 24, 2023 •

edited

Loading

jtcohen6 commented Jun 16, 2023

		-- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table
		-- Otherwise, this will be a view

		@@ -444,7 +445,10 @@ def before_run(self, adapter, selected_uids: AbstractSet[str]):
		with adapter.connection_named("master"):



		{% macro default__get_pointer_sql(to_relation) %}
		{% set pointer_sql %}

New command: dbt clone #7258

New command: dbt clone #7258

Conversation

jtcohen6 commented Apr 1, 2023 • edited Loading

Description

TODOs

Example

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungchun12 commented Apr 5, 2023

jtcohen6 commented Apr 10, 2023

epapineau commented Apr 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VersusFacit May 24, 2023 • edited Loading

Choose a reason for hiding this comment

jtcohen6 commented Jun 16, 2023

New command: `dbt clone` #7258

New command: `dbt clone` #7258

jtcohen6 commented Apr 1, 2023 •

edited

Loading

VersusFacit May 24, 2023 •

edited

Loading