Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3162] [Feature] Ability to dbt clone sources #8722

Open
3 tasks done
leoebfolsom opened this issue Sep 26, 2023 · 4 comments
Open
3 tasks done

[CT-3162] [Feature] Ability to dbt clone sources #8722

leoebfolsom opened this issue Sep 26, 2023 · 4 comments
Labels
clone related to the dbt clone command enhancement New feature or request

Comments

@leoebfolsom
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I wish to dbt clone sources that are defined in the yml.

Why?

I want to analyze dbt code changes and their potential impact on the data that aren't easily found by dbt tests, such as (non-exhaustively) ...

  • value differences that still pass dbt tests, such as 60.0 changing to 60.1, or Leo changing to Leon
  • fan-out that doesn't violate uniqueness tests
  • changes in row counts caused by missing/added PKs

... before merging to main.

Can't you just compare a PR build to a prod build?
Not quite. Sources change over time, and for many teams, data loads to the DWH (leading to updated sources) more frequently than prod dbt jobs run. So, when building data in CI, unless I point upstream to sources that were cloned at the time of the last dbt run, the DAG is non-idempotent 😭 .

In other words: dbt runs at different times using the same code have different results.

In more other words: differences in the data are explained not by code changes in the PR, but rather by shifting upstream source data.

This makes it hard to test code changes for regressions because there's another variable (shifting source data).

How does dbt cloneing sources help with this problem?

If I could rebuild part or all of the DAG later using clones of sources that were created at the time of the last production run, I could ensure downstream models were built using identical, non-updated source data. Then, I could do a better job of validating that dbt code changes in the PR are not going to break stuff before I hit merge.

Why not use deferral / Slim CI?

Deferral and Slim CI only solves this problem if the model does not depend on sources.

While it is "best practice" to have a base layer, which will significantly reduce the described pain; in practice / in the field, many orgs will not do this, or are not ready to do this yet, for whatever reason.

Furthermore, even if you have a base layer, that base layer might be modified in a PR, which would again lead to shifting source data, because they would be refetched from the DWH; so having a base layer is really just a 90%-of-the-way-there workaround.

Describe alternatives you've considered

  • Taking snapshots somehow, or leveraging DWH time travel (also somehow), and orchestrating manually
  • Trying to get everyone who uses dbt to build a base layer, so that deferral to prod pretty much always addresses this issue
    • The issue would still exist because sources would be referenced when base models are updated, leading to shifting-over-time source data
  • Creating two versions of the DAG (subsets, presumably) using two difference source code bases in CI
    • This could mostly solve the problem, but:
      • There could still be slight differences in source data because of timing (dbt run on prod-branch at 10:00, dbt run on pr-branch at 10:05)
      • This approach is a waste of build time and DWH resources, as compared to leveraging sources and comparing to existing prod data.

Who will this benefit?

dbt users who want to implement rigorous regression testing.

Are you interested in contributing this feature?

I probably can't help much with the dbt-core update, but happy to review, or kick around ideas, or make beginner contribs, if helpful

Anything else?

Slack discussion, where @dave-connors-3 suggested I open an issue. 🙇

@leoebfolsom leoebfolsom added enhancement New feature or request triage labels Sep 26, 2023
@github-actions github-actions bot changed the title [Feature] Ability to dbt clone sources during a build [CT-3162] [Feature] Ability to dbt clone sources during a build Sep 26, 2023
@leoebfolsom leoebfolsom changed the title [CT-3162] [Feature] Ability to dbt clone sources during a build [CT-3162] [Feature] Ability to dbt clone sources Sep 26, 2023
@dataders
Copy link
Contributor

I appreciate you taking time to write up both here and in the Slack thread you mentioned!

My initial reaction, is that this request is a no-brainer. This should be possible.

However, the biggest barrier to shipping this is architecture of dbt-core, in that today, sources are distinct from what we call nodes (models, seeds, snapshots, tests). This is apparent in the schema of our manifest, in that sources and nodes are distinct top-level keys.

The way clone works today is purely within the realm of nodes, which is why we cannot do this today.

It's no stretch for me to imagine a dbt of the future were sources are nodes. However, this is uncharted territory and would certainly require some non-trivial refactoring to facilitate what is really a conceptual/architectural shift.

We're thinking a lot of sources right now. #8617 is not about clone, but presents an opportunity to further develop sources in a direction conducive to your ask here.

Below I offer a way to unblock you today, but I won't rule out that perhaps there is short- or medium-term work that might enable you more elegantly without a fundamental rework of dbt key concepts. @leoebfolsom happy to brainstorm more with you on this.

kludgy workaround

cleverly generating wrapper view models of sources then cloning those.

@dataders dataders removed the triage label Sep 27, 2023
@leoebfolsom
Copy link
Contributor Author

leoebfolsom commented Sep 28, 2023

Thanks @dataders! Got it. It doesn't surprise me that this can't be a quick win. I'm glad to have your eyes on it, and hopefully as other issues are addressed, this one will easily follow, when possible!

Re: the kludgy workaround (TKW), "cleverly generating wrapper view models of sources then cloning those", that sounds a lot like a base layer to me, unless I'm missing something? If this was only for my org's dbt project, I could potentially advocate for it. But ... I'm looking for a solution that could be implemented across many orgs widely different cultures and reqs; which, to be fair, is how dbt-core should work. So I'm saying: good idea, could work for some, will definitely keep it up my sleeve, but not a resolution of the issue. We agree.

What I'm currently exploring and recommending is: the creation of a "representation of prod" that is built in CI. Imagine PR_NUM_123 and PR_NUM_123_PROD existing side-by-side in your DWH. They are built at nearly the same time, and so CI data drift is basically, but not necessarily completely, eliminated.

I do see that if all nodes were treated the same--let nodes be nodes!--as is brilliantly and hilariously articulated here by @alison985, a lot of this would be sorted and unblocked.

Exciting! Forward!

@dataders
Copy link
Contributor

dataders commented Sep 28, 2023

while we're on the subject

@aranke flagged #7442 as a stopgap solution that we'll ship in support of contracted models. It explains more of the chewy nuance that's less about sources, and more about delivering something here that is simultaneously intuitive, non-magical (muggle-y?) and without footguns. Is anything about the approach feeling robust enough to be more widely used pattern than a kludgy workaround for a single data team?

another way in

another workaround he surfaced was the possibility of invoking the create_or_replace_clone() macro directly. This is effectively the pith of the clone materialization, but would unblock you to clone sources. Even if it's not sturdy enough to serve in "production" for many folks today, a proof of concept of a clone_source() macro could serve at the reference we use when we are ready to do this the right way.

if you hack it, it they will clone

If I were to do this, I'd be leaning heavily on the following, and the result might look like the below

⚠️ caveat clonitur ⚠️
some psuedocode I wrote didn't even try running

{% macro folsom_cloning_blues() %}
{% set source_nodes = graph.sources.values() if graph.sources else [] %}
{% for source in source_nodes %}
    {% for node in source %}
        {# the juicy bit -- stdout logging here is helpful #}
        {% set indigo_schema = node.schema ~'__clone' %}
        {# i think this is what incorporate does? #}
        {% set indigo_node = node.incorporate(schema=indigo_schema) %} 
        {%- call statement('doug_coupland_is_right', fetch_result=False) -%}
            create_or_replace_clone(node, indigo_node)
        {%- endcall -%}
    {% endfor %}
{% endfor %}
{% endmacro %}

p.s. calling out that this sounds similar to blue-green deployments. can we brand this blue-green-indigo deployment (orange-blue-green)?

"representation of prod" that is side-by-side in your DWH

@dataders
Copy link
Contributor

dataders commented Nov 1, 2023

One month later, and I myself find myself wanting to clone a source. I tried to make a Dynamic table off of a source, and got this message.

003001 (42501): SQL access control error:
Insufficient privileges to operate on base table to automatically enable CHANGE_TRACKING for dynamic table 'dataders_test'

naturally, I'm thinking

oh I'll just clone this table somewhere so I can control it.

Which, I cannot do with dbt clone today

@dbeatty10 dbeatty10 added the clone related to the dbt clone command label Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clone related to the dbt clone command enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants