Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command: dbt clone #7258

Closed
wants to merge 7 commits into from
Closed

New command: dbt clone #7258

wants to merge 7 commits into from

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Apr 1, 2023

resolves #7256

$ dbt clone --state state/ --full-refresh

Description

Introduce clone command, which also makes use of a clone materialization.

  • On data platforms that support create table <dev> clone <prod>, let's use it
  • Otherwise, create views that are simple pointers (create view <dev> as select * from <prod>)
  • Don't recreate objects that already exist in the target schemas, unless --full-refresh is passed

Some interesting behaviors around:

  • accessing the --state manifest in a way that's similar to, but not exactly the same as, --defer
  • caching: we want to cache the other/prod schemas, in addition to the dev/target ones

TODOs

  • Drop some comments below where I've left TODOs, or made some more inspired choices
  • Timing comparison of dbt clone --threads 50 --full-refresh (~3.5 minutes) versus create schema <dev> clone <prod> (~10 min) using our real Snowflake project with ~1k models in it. That's ~65% faster!
  • Add functional tests :) — in progress!

Example

I have a seed, a view model, and a table model.

From logs/dbt.log:

19:07:08.285835 [debug] [Thread-1  ]: Using snowflake connection "model.test.another_model"
19:07:08.286490 [debug] [Thread-1  ]: On model.test.another_model: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.test.another_model"} */
create or replace
      transient
      table analytics.dbt_jcohen.another_model
      clone analytics.dbt_jcohen_prod.another_model
19:07:08.287087 [debug] [Thread-2  ]: Using snowflake connection "model.test.my_model"
19:07:08.287513 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
19:07:08.287933 [debug] [Thread-2  ]: On model.test.my_model: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "model.test.my_model"} */
create or replace   view analytics.dbt_jcohen.my_model
  
   as (
    
        select * from analytics.dbt_jcohen_prod.my_model
    
  );
19:07:08.288501 [debug] [Thread-3  ]: Using snowflake connection "seed.test.my_seed"
19:07:08.289204 [debug] [Thread-2  ]: Opening a new connection, currently in state closed
19:07:08.289592 [debug] [Thread-3  ]: On seed.test.my_seed: /* {"app": "dbt", "dbt_version": "1.5.0b5", "profile_name": "sandbox-snowflake", "target_name": "dev", "node_id": "seed.test.my_seed"} */
create or replace
      transient
      table analytics.dbt_jcohen.my_seed
      clone analytics.dbt_jcohen_prod.my_seed
19:07:08.290207 [debug] [Thread-3  ]: Opening a new connection, currently in state init
19:07:09.648042 [debug] [Thread-2  ]: SQL status: SUCCESS 1 in 1.0 seconds
19:07:09.669806 [debug] [Thread-2  ]: Timing info for model.test.my_model (execute): 19:07:08.250276 => 19:07:09.669642
19:07:09.670302 [debug] [Thread-2  ]: On model.test.my_model: Close
19:07:10.163990 [debug] [Thread-2  ]: Finished running node model.test.my_model
19:07:10.267498 [debug] [Thread-1  ]: SQL status: SUCCESS 1 in 2.0 seconds
19:07:10.273490 [debug] [Thread-1  ]: Timing info for model.test.another_model (execute): 19:07:08.231088 => 19:07:10.273041
19:07:10.274599 [debug] [Thread-1  ]: On model.test.another_model: Close
19:07:10.411055 [debug] [Thread-3  ]: SQL status: SUCCESS 1 in 2.0 seconds
19:07:10.415530 [debug] [Thread-3  ]: Timing info for seed.test.my_seed (execute): 19:07:08.275505 => 19:07:10.415249

Checklist

@cla-bot cla-bot bot added the cla:yes label Apr 1, 2023

{% macro snowflake__get_clone_table_sql(this_relation, state_relation) %}
create or replace
{{ "transient" if config.get("transient", true) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake requires that table clones match the transience/permanence of the table they're cloning.

We determine if the table is a table based on the cache result from the other/prod schema, but here we're just using the current (dev) configuration for transient. There's the possibility of a mismatch if a user has updated the transient config in development.

@@ -1362,6 +1362,20 @@ def this(self) -> Optional[RelationProxy]:
return None
return self.db_wrapper.Relation.create_from(self.config, self.model)

@contextproperty
def state_relation(self) -> Optional[RelationProxy]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to naming suggestions! This will only be available in the context for the clone command currently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "relation" being used as a byword for table/view. Relation convene, so many things, and I feel like we want an additional term or a better term that can be more specific about what you're trying to achieve. Perhaps "stateful_db_relation". I really don't know the subtleties here though, so you might have picked the best one.

Comment on lines 1051 to 1054
state_relation = RelationalNode(
other_node.database, other_node.schema, other_node.alias
)
self.nodes[unique_id] = current.replace(state_relation=state_relation)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're storing information about each node's production-state counterpart, right on the node entry in the manifest. I'm open to discussing whether this is the right approach. It feels better than passing the entire other manifest around into other methods/runners.

@@ -567,6 +571,7 @@ class HookNode(CompiledNode):
class ModelNode(CompiledNode):
resource_type: NodeType = field(metadata={"restrict": [NodeType.Model]})
access: AccessType = AccessType.Protected
state_relation: Optional[RelationalNode] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the approach of storing stateful information on each node about its prod counterpart, we do need a place on the node object to do that. For now, I'm adding a new attribute (default None) to models, seeds, and snapshots — the three refable nodes that are eligible for deferral & cloning.

Comment on lines +1 to +13
{% macro can_clone_tables() %}
{{ return(adapter.dispatch('can_clone_tables', 'dbt')()) }}
{% endmacro %}


{% macro default__can_clone_tables() %}
{{ return(False) }}
{% endmacro %}


{% macro snowflake__can_clone_tables() %}
{{ return(True) }}
{% endmacro %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of True/False conditional behavior might be better as an adapter property/method (Python), since it's really a property of the adapter / data platform, rather than something a specific user wants to reimplement. Comparable to the "boolean macros" we defined for logic around grants. Open to thoughts!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to state the obvious: any snowflake__ macros would want to move to dbt-snowflake as part of implementing & testing this on our adapters! I'm just defining them here for now for the sake of comparison & convenience (= laziness)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with both comments

Comment on lines +384 to +388
if get_flags().CACHE_SELECTED_ONLY is True:
required_schemas = self.get_model_schemas(adapter, selected_uids)
self.populate_adapter_cache(adapter, required_schemas)
else:
self.populate_adapter_cache(adapter)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in RunTask.before_run (comment above)

Comment on lines +319 to +322
# TODO: need an "adapter zone" version of this test that checks to see
# how many of the cloned objects are "pointers" (views) versus "true clones" (tables)
# e.g. on Postgres we expect to see 4 views
# whereas on Snowflake we'd expect to see 3 cloned tables + 1 view
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this test, or some extension of it, should be in the "adapter zone" so that we can verify this functionality:

  • On data platforms that support create table <dev> clone <prod>, let's use it
  • Otherwise, create views that are simple pointers (create view <dev> as select * from <prod>)

Comment on lines +60 to +61
-- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table
-- Otherwise, this will be a view
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data platforms that support table cloning:

  • Snowflake (docs)
  • BigQuery (docs)
  • Databricks (docs, with two modes: "shallow" (zero-copy) and "deep" (full copy)

Data platforms that don't:

  • Postgres
  • Redshift
  • Trino

Comment on lines +78 to +80
{% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %}
{% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %}
{% do persist_docs(target_relation, model) %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should still apply grants & table/column-level comments. I have a suspicion that whether these things are copied over, during cloning, varies by data platform; I should really look to confirm/reject that suspicion. It's also possible that the user has defined conditional logic for these that differs between dev & prod, especially grants.

Comment on lines +1260 to +1261
if "state_relation" in node:
del node["state_relation"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like state_relation is a thing we should need/want to include in serialized manifest.json. It's really just for our internal use.

@sungchun12
Copy link
Contributor

Holy crap. Jerco and team, how are you shipping soooo much value right now?!?!

@jtcohen6
Copy link
Contributor Author

A decent amount of time is spent checking/updating the cache, which locks across all the concurrent threads. This could be even faster (2 min for 1k models) with the change proposed in #6844.

I think we'd want to modify the behavior of run_queue for this command, such that it:

  • Does not skip on the first failure
  • Does not require running nodes in graph order, or skipping nodes downstream of any failed nodes

@epapineau
Copy link
Contributor

Candidate for favorite PR of the year 👏🏻

@@ -444,7 +445,10 @@ def before_run(self, adapter, selected_uids: AbstractSet[str]):
with adapter.connection_named("master"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should not be master if we're trying to avoid that language elsewhere

"""
For commands which add information about this node's corresponding
production version (via a --state artifact), access the Relation
object for that stateful other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other feels vague (well, really, just Latinate/French in phrasing :) ). "source node", etc.?



{% macro default__get_pointer_sql(to_relation) %}
{% set pointer_sql %}
Copy link
Contributor

@VersusFacit VersusFacit May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uncomfortable with this use of pointer. I'm not exactly sure why were drawing that comparison, when perhaps we should use something closer to reference or allude to the fact that there's a shadow copy. In my mind pointer is just a very specific thing and somewhat anachronistic in a SQL context.

@jtcohen6
Copy link
Contributor Author

closing in favor of #7881 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2348] [Feature] dbt clone command
5 participants