Feature/generate_model_import_ctes #74

graciegoheen · 2022-08-11T22:11:20Z

This is a:

bug fix PR with no breaking changes — please ensure the base branch is main
new functionality — please ensure the base branch is the latest dev/ branch
a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

I created a macro to generate the SQL for a given model with all references pulled up into import CTEs, which you can then paste back into the model.

Checklist

I have verified that these changes work locally
I have updated the README.md (if applicable)
I have added tests & descriptions to my models (and macros if applicable)
I have added an entry to CHANGELOG.md

README.md

christineberger · 2022-08-12T15:03:43Z

This is an awesome idea! I'm going to document my findings in this PR as I'm testing.

I wanted to test two things:

What does the macro do if there are more than one explicit reference to a table (for example, using orders outside of a subquery and using the same table inside of a subquery)
What does the macro do with explicit references when they are defined in different ways? For example, you could have an explicit reference like development.tpch_sf1.orders or tpch_sf1.orders.

I ran into my first issue, which is that the macro doesn't get the correct references and output the right import CTEs when the explicit reference has the database name attached.

Here is my model code:

WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from raw_tpch.tpch_sf1.orders o
    inner join raw_tpch.tpch_sf1.customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM raw_tpch.tpch_sf1.lineitem l LEFT JOIN raw_tpch.tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN raw_tpch.tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch.tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

and here is the compiled code from the macro:

with tpch_sf1 as (
    select * from raw_tpch.tpch_sf1
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from tpch_sf1.orders o
    inner join tpch_sf1.customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM tpch_sf1.lineitem l LEFT JOIN tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

Update
I managed to get some correct-ish output adding these lines (and I'm sure you can do this more concisely, but just wanted to get it to work):

An addition to the matches: 'from_table_3':'(?i)(from|join)\s+([\["]?\w+[]\"]?\.)([\["]?\w+[]\"]?\.)([\["]?\w+[]\"]?)',

An addition to the CTE naming logic:

            {%- if regex_name == 'from_table_3' -%}
            {%- set cte_name = match[3]|replace("'","")|trim|lower -%}
            {%- else -%}
            {%- set cte_name = match[2]|replace("'","")|trim|lower -%}
            {%- endif -%}

An addition to the SQL generation: {%- elif from_obj[2] == 'from_table_1' or from_obj[2] == 'from_table_2' or from_obj[2] == 'from_table_3' %}

Output:

with raw_tpch_tpch_sf1_customer as (
    select * from raw_tpch.tpch_sf1.customer
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_lineitem as (
    select * from raw_tpch.tpch_sf1.lineitem
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_nation as (
    select * from raw_tpch.tpch_sf1.nation
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_orders as (
    select * from raw_tpch.tpch_sf1.orders
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
tpch_sf1 as (
    select * from raw_tpch.tpch_sf1
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from raw_tpch_tpch_sf1_orders o
    inner join raw_tpch_tpch_sf1_customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

graciegoheen · 2022-08-12T15:12:24Z

@christineberger Can I ask what you think should be done in this situation? Can we make the assumption that development.tpch_sf1.orders and tpch_sf1.orders are the same? Or is it just a matter of giving them unique names? I'm imaging a scenario in which there is development.tpch_sf1.orders, tpch_sf1.orders, and {{ ref('orders') }} - should these be treated as separate import CTEs and if so, what should the CTE naming convention be?

christineberger · 2022-08-12T15:22:20Z

@graciegoheen That's a really good question that I'm not sure there's a straight answer to, but we could always just make a v1 decision - I think in order to avoid conflicts, maybe we just treat them as separate import CTEs and the user will need to resolve that themselves.

(I think) we can assume that if there's an explicit reference without a database name / schema name, it's referring to the target location. If not, then we know the explicit location - so we could name a references like this:

raw_data.salesforce.orders -> raw_data_salesforce_orders
salesforce.orders -> <target_database>_salesforce_orders
orders -> <target_database>_<target_schema>_orders

I haven't thought this all the way through, but what I'm trying to say is yes to separate import CTES!

christineberger · 2022-08-12T15:23:46Z

Also, I'm just documenting what is working and what's not working, and those can either be limitations of the v1 or fixed in this PR - There's absolutely no issue in getting it working "for the most part" as long as we have the limitations documented!

macros/generate_model_import_ctes.sql

christineberger · 2022-08-12T17:01:45Z

macros/generate_model_import_ctes.sql

+    {%- set from_regexes = {
+        'from_ref':'(?i)(from|join)\s+({{\s*ref\s*\()([^)]+)(\)\s*}})',
+        'from_source':'(?i)(from|join)\s+({{\s*source\s*\([^)]+,)([^)]+)(\)\s*}})',
+        'from_table_1':'(?i)(from|join)\s+([\[`\"]?\w+[\]`\"]?)\.([\[`\"]?\w+[\]`\"]?)',


This may or may not be an issue depending on how this is restructured to capture 3-part vs. 2-part vs 1-part explicit references, but in the output while accounting for a 3-part reference, I'm seeing an additional cte for raw_tpch.sf_1 with no table. This is because one of the regex matches are capturing the pattern for a 2-part reference. To fix this temporarily I added a $ to the end of from_table_1 in order to tell it "just match this and nothing after it", but there are probably many ways you could be more concise or offer more flexibility without limiting the regex match in that way. Something to consider as part of the explicit references piece!

Output:

with raw_tpch_tpch_sf1_customer as ( select * from raw_tpch.tpch_sf1.customer -- CAUTION: It's best practice to use the ref or source function instead of a direct reference ), raw_tpch_tpch_sf1_lineitem as ( select * from raw_tpch.tpch_sf1.lineitem -- CAUTION: It's best practice to use the ref or source function instead of a direct reference ), raw_tpch_tpch_sf1_nation as ( select * from raw_tpch.tpch_sf1.nation -- CAUTION: It's best practice to use the ref or source function instead of a direct reference ), raw_tpch_tpch_sf1_orders as ( select * from raw_tpch.tpch_sf1.orders -- CAUTION: It's best practice to use the ref or source function instead of a direct reference ), -- Only orders which tie back to a real customer record ro as ( select o.o_orderkey, c.c_custkey from raw_tpch_tpch_sf1_orders o inner join raw_tpch_tpch_sf1_customer c on o.o_custkey = c.c_custkey ), rl as ( SELECT c.c_custkey ,c.c_name ,sum(revenue) AS revenue_lost ,c_acctbal ,n.n_name ,c_address ,c_phone ,c_comment FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC ) select * from rl

I wonder if in another version we can work on pulling out subqueries into CTEs 🤔 If they have any joins anywhere after "from", we can assume that they have aliased the subquery with or without using AS. We can take the alias and create a CTE for it using the subquery text. If there's no joins within the entire from statement, then we could just temporarily name them. I imagine this will warrant an entire macro to itself!

Did a re-work on the replacing logic to be more precise - using regex sub instead of replace. Let me know if you still have this issue!

Also love the subquery idea - that would be a really cool addition!

macros/generate_model_import_ctes.sql

jeremyholtzman · 2022-08-12T18:43:17Z

@graciegoheen I don't think this has been mentioned yet, but I tested it out with the formatting of joins below, and it only caught the first table.

-- weird_joins.sql
select *
from 
    {{ ref('stg_tpch_customers') }},
    {{ ref('stg_tpch_nations') }}

where stg_tpch_customers.nation_id = stg_tpch_nations.nation_id

Results:

{{ generate_model_import_ctes('weird_joins') }}

with stg_tpch_customers as (
    select * from {{ ref('stg_tpch_customers') }}
  
),
select *
from 
    stg_tpch_customers,
    {{ ref('stg_tpch_nations') }}

where stg_tpch_customers.nation_id = stg_tpch_nations.nation_id

Co-authored-by: Benoit Perigaud <8754100+b-per@users.noreply.github.com>

graciegoheen · 2022-08-12T21:22:34Z

@jeremyholtzman TBH I didn't even know you could do joins like that - need to think of a way to actually capture that with regex...

…labs/dbt-codegen into feature/generate_model_ctes

dbeatty10 · 2022-08-13T15:00:53Z

@graciegoheen an idea for you to handle cases like the one that @jeremyholtzman surfaced:

import a SQL parser to do the dirty work for you

Here's a non-exhaustive list of parsers (your mileage will vary):

Example

SQLGlot is pretty cool -- it has a parser, transpiler, and optimizer, and (at the time of this writing) it has some level of dialect support for bigquery, clickhouse, duckdb, hive, mysql, oracle, postgres, presto, snowflake, spark, sqlite, starrocks, tableau (?!), and trino. (Kinda off-topic, but transpiling means it can read from one dialect and write out to another.)

Install

If you are using zsh or bash as your shell, you can copy-paste the commands below. Otherwise, follow the shell-specific instructions here.

python3 -m venv venv
source venv/bin/activate 
python3 -m pip install sqlglot

Run

Launch Python in interactive mode:

python

Create a SQL query using the Postgres dialect:

from sqlglot import parse_one, exp

# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z", read="postgres").find_all(exp.Table):
    print(table.name)

This and other examples in their documentation.

dbeatty10 · 2022-08-13T18:28:03Z

Crazy idea # 1

look at the optimizer portion of SQLGlot for inspiration how to re-write a SQL query (using the parsed AST) with all references pulled up into import CTEs

Crazy idea # 2

(Which is not actually related to this pull request!)

create a custom dbt dialect for SQLGlot.

This dialect could replace database-specific keywords, functions, etc with their cross-database macro equivalents.

Idea # 2 would be most relevant for code that needs/wants to be cross-database compatible, like packages on the dbt Hub.

b-per · 2022-08-15T09:47:28Z

I personally feel that we might tweak the regexes to make them work in most of the cases (even if they won't work 100% of the time) but avoid using third-party libraries. (I'd be keen to have as many tests as possible on the regexes though because backward compatibility can be difficult to identify with regexes whenever we do a change)

Without third-party libraries, we can run the macro in dbt Cloud (even though we may not catch 100% of the use cases)
I think that SQLFluff already handles getting it 100% correct (see Rule 42 that supports the fix command) if someone can use Python locally

christineberger · 2022-08-31T16:37:30Z

@graciegoheen there are commas missing after the import CTEs in my output:

Original Query

select customer_id,
        customer_name,
        coalesce(purchase_total,0) as purchase_total,
        coalesce(lr.revenue_lost, 0) as return_total,
        coalesce(purchase_total - return_total, 0) as lifetime_value,
        (coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
        case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
    from 
        (select c.c_custkey as customer_id,
            c.c_name as customer_name,
            round(sum(customer_cost), 2) as purchase_total
        from {{ source('TPCH_SF1', 'CUSTOMER') }} c
            left join {{ ref('order_line_items') }} oli
                on c.c_custkey = oli.customer_id
            where item_status != 'R'
            group by c_custkey, c_name
        order by c_custkey) co
        left join 
            {{ ref('lost_revenue') }} lr
                on co.customer_id = lr.c_custkey

Output

with lost_revenue as (

    select * from {{ ref('lost_revenue') }}
  
)
order_line_items as (

    select * from {{ ref('order_line_items') }}
  
)
source_customer as (

    select * from {{ source('TPCH_SF1', 'CUSTOMER') }} 
    -- CAUTION: It's best practice to create staging layer for raw sources
  
)
select customer_id,
        customer_name,
        coalesce(purchase_total,0) as purchase_total,
        coalesce(lr.revenue_lost, 0) as return_total,
        coalesce(purchase_total - return_total, 0) as lifetime_value,
        (coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
        case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
    from 
        (select c.c_custkey as customer_id,
            c.c_name as customer_name,
            round(sum(customer_cost), 2) as purchase_total
        from source_CUSTOMER c
            left join order_line_items oli
                on c.c_custkey = oli.customer_id
            where item_status != 'R'
            group by c_custkey, c_name
        order by c_custkey) co
        left join lost_revenue lr
                on co.customer_id = lr.c_custkey

dave-connors-3 · 2022-09-01T14:52:38Z

@graciegoheen -- I am also missing commas! I ran this against my coalesce demo project from last year

Input

-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere

select 
    c.id as claim_id,
    p.name_f || ' ' || p.name_l as pat_name,
    p.dob,
    d.name as doctor_name,
    d.npi,
    (select max(specialty) from development.dbt_dconnors_ehr_data.doc_specialties ds where ds.doc_id = d.id) as spec_name,
    prim_diag.icd_10_code as primary_diag,
    prim_diag.icd_10_code_descrip as primary_diag_desc,
    sec_diag.icd_10_code as sec_diag,
    sec_diag.icd_10_code_descrip as sec_diag_desc,
    c.ClaimNumber,
    ca.total_claim_amnt,
    ca.paid_amnt
from 
    development.dbt_dconnors_ehr_data.patients p
    join development.dbt_dconnors_billing_data.claims c
        on p.id = c.pat_id
    left join (
        select 
            cl.claim_id, 
            sum(chrgamnt) total_claim_amnt,
            sum(paid_amnt) paid_amnt
        from development.dbt_dconnors_billing_data.claim_line cl
        where status in ('billed', 'adjudicated', 'closed')
        group by 1
    ) ca 
        on c.id = ca.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development.dbt_dconnors_billing_data.claim_diagnosis
        where pos = 1
    ) as prim_diag
    on c.id = prim_diag.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development.dbt_dconnors_billing_data.claim_diagnosis
        where pos = 2
    ) as sec_diag
    on c.id = sec_diag.claim_id
    left join development.dbt_dconnors_ehr_data.doctors d
        on c.doc_id = d.id 

where c.test = false 
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0

Output

with development_dbt_dconnors_billing_data_claim_diagnosis as (

    select * from development.dbt_dconnors_billing_data.claim_diagnosis
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_billing_data_claim_line as (

    select * from development.dbt_dconnors_billing_data.claim_line
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_billing_data_claims as (

    select * from development.dbt_dconnors_billing_data.claims
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_doc_specialties as (

    select * from development.dbt_dconnors_ehr_data.doc_specialties
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_doctors as (

    select * from development.dbt_dconnors_ehr_data.doctors
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_patients as (

    select * from development.dbt_dconnors_ehr_data.patients
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere

select 
    c.id as claim_id,
    p.name_f || ' ' || p.name_l as pat_name,
    p.dob,
    d.name as doctor_name,
    d.npi,
    (select max(specialty) from development_dbt_dconnors_ehr_data_doc_specialties ds where ds.doc_id = d.id) as spec_name,
    prim_diag.icd_10_code as primary_diag,
    prim_diag.icd_10_code_descrip as primary_diag_desc,
    sec_diag.icd_10_code as sec_diag,
    sec_diag.icd_10_code_descrip as sec_diag_desc,
    c.ClaimNumber,
    ca.total_claim_amnt,
    ca.paid_amnt
from development_dbt_dconnors_ehr_data_patients p
    join development_dbt_dconnors_billing_data_claims c
        on p.id = c.pat_id
    left join (
        select 
            cl.claim_id, 
            sum(chrgamnt) total_claim_amnt,
            sum(paid_amnt) paid_amnt
        from development_dbt_dconnors_billing_data_claim_line cl
        where status in ('billed', 'adjudicated', 'closed')
        group by 1
    ) ca 
        on c.id = ca.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development_dbt_dconnors_billing_data_claim_diagnosis
        where pos = 1
    ) as prim_diag
    on c.id = prim_diag.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development_dbt_dconnors_billing_data_claim_diagnosis
        where pos = 2
    ) as sec_diag
    on c.id = sec_diag.claim_id
    left join development_dbt_dconnors_ehr_data_doctors d
        on c.doc_id = d.id 

where c.test = false 
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0

christineberger · 2022-09-01T15:48:03Z

Leaving some additional comments here, but I'm not exactly sure if you wanted to account for these scenarios or not! I personally don't think they are worth holding up this PR for, so happy to approve after the missing commas after multiple ctes issue is fixed. This macro can be iterated on for these if we think they're worth it, and people should always be checking what's generated and tweak the output if they need to anyway.

Scenario 1 | Joins by Comma

Input

{{
    config(
        materialized='table'
    )
}}
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM 
    {{ source('TPCH_SF1', 'PARTSUPP') }},
    {{ source('TPCH_SF1', 'SUPPLIER') }},
    {{ source('TPCH_SF1', 'NATION') }},
    {{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY

Output

{{
    config(
        materialized='table'
    )
}}

with source_partsupp as (

    select * from {{ source('TPCH_SF1', 'PARTSUPP') }}
  
)
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM source_PARTSUPP,
    {{ source('TPCH_SF1', 'SUPPLIER') }},
    {{ source('TPCH_SF1', 'NATION') }},
    {{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY

Scenario 2 | Non-explicit table references

I do concede this is a bad practice and probably an edge case since it would require the person's target to have all objects that they could be referring to, but if someone is migrating a script this very well could be the case.

Input

with 

select 
    e.p_name            AS Part_Name,
    e.p_retailprice     AS RetailPrice,
    e.s_name            AS Supplier_Name,
    e.p_mfgr            AS Part_Manufacturer,
    e.s_address         AS SuppAddr,
    e.s_phone           AS Supp_Phone,
    ps.PS_AVAILQTY      AS Num_Available 

from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }} e
LEFT JOIN {{ source('TPCH_SF1', 'SUPPLIER') }} s on e.S_NAME = s.S_NAME
LEFT JOIN
    partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
    and s.S_SUPPKEY = ps.PS_SUPPKEY

where n_name = 'UNITED KINGDOM'

Output

with eur_lowcost_brass_suppliers as (

    select * from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }}
  
),
source_supplier as (

    select * from {{ source('TPCH_SF1', 'SUPPLIER') }} 
    -- CAUTION: It's best practice to create staging layer for raw sources
  
),
select 
    e.p_name            AS Part_Name,
    e.p_retailprice     AS RetailPrice,
    e.s_name            AS Supplier_Name,
    e.p_mfgr            AS Part_Manufacturer,
    e.s_address         AS SuppAddr,
    e.s_phone           AS Supp_Phone,
    ps.PS_AVAILQTY      AS Num_Available 

from EUR_LOWCOST_BRASS_SUPPLIERS e
LEFT JOIN source_SUPPLIER s on e.S_NAME = s.S_NAME
LEFT JOIN
    partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
    and s.S_SUPPKEY = ps.PS_SUPPKEY

where n_name = 'UNITED KINGDOM'

graciegoheen · 2022-09-09T18:12:06Z

@christineberger and @dave-connors-3 the missing comma issue should now be resolved!

Thank you Christine for identifying those outstanding issues. I think I'm okay with saying "Joins by Comma" is out of scope for this PR (but curious if other people disagree). "Non-explicit table references" would be tricky to capture, because the syntax for that is identical to selecting from a cte. How often does that come up? We could also call these out as two known issues (either in the README or actually opening up github issues) that won't currently be covered by this macro. That way, at least no one is surprised that it didn't work for their use case.

macros/generate_model_import_ctes.sql

README.md

cohms

Tested this out on BigQuery and it worked! Only change I think should be made is to the README, and be explicit in the instructions to call codegen.generate_model_import_ctes to prevent errors and to match other documentation.

Not worth changing but a heads up: the only other thing I noticed was that if you have some import CTEs but not all, it will create semi-duplicate CTEs at the top (I'd already created the dependencies import CTE)

BUT not something I think you should change or worry about. The use case is clearly for the code that doesn't have any import CTEs rather than some. And extra CTEs won't hurt even if they do already have some import CTEs.

cohms

Looks great!

b-per

I tried a few queries and it worked fine. Awesome work Grace, this will be super valuable for refactoring code!

My only comment would be to slightly change the code to generate SQL that follows our new conventions for CTEs, i.e.

cte1 as (
    ...
),

cte2 as (
    ...
),

with 1 blank row between CTEs

dave-connors-3

🦞

joellabes · 2022-10-04T04:39:45Z

@graciegoheen Are you ready for this to be merged and released?

I haven't done any sort of review of this, but I did have a peek at the readme. My one piece of feedback is that I assumed this was taking a string of code and reformatting it; turns out it takes a model and reformats raw_sql or raw_code. Curious as to why that was the approach (as opposed to audit_helper's pattern where you create your sql strings and pass them into the macro), but not against it at all!

However, I think it would be useful to clarify that in your README - as a foolish optimist who has never worked on proserv, I assume that models are already clean and follow dbt best practices, so I couldn't work out what import CTEs you were going to create.

Maybe you say "make a model with the janky code, and then point this macro at it" or something?

Or consider doing it as

{% set old_query %}
-- bad query here
{% endset %}

{{ codegen.generate_model_import_ctes(old_query) }}

graciegoheen · 2022-10-04T15:44:58Z

@joellabes Yes, I'm feeling ready!

Great feedback. Since all of the other macros in this package take either a source or model name as the input, I was following the same pattern for this one. If this belongs in audit_helper instead (or a new package entirely), passing in a string of code could make more sense. Even so, typically our pattern for migrating code is:

make a model in dbt
copy/paste the legacy code
refactor! refactor! refactor!

So, I was imagining the workflow where you've created your model in dbt, copy/pasted in your old code, then want to generate the same sql with references to tables pulled out into import CTEs.

To me, opening up a statement tab, calling this macro, compiling the code, and copy/pasting in the compiled output feels more intuitive (in the context of the other codegen macros) than nesting the legacy code in a set / endset block and then calling the macro. But could totally be the minority here!

joellabes · 2022-10-04T18:46:24Z

To me, opening up a statement tab, calling this macro, compiling the code, and copy/pasting in the compiled output feels more intuitive

Yeah I still thought this would happen in a scratchpad - I would have imagined this workflow:

Copy the legacy code out of whatever tool currently contains it
Paste it into a scratchpad file in dbt Cloud between set/endset tags
- The macro is called at the bottom of the scratchpad file
Compile it
Copy the resulting compiled code into a new dbt model
refactor! refactor! refactor!

By doing that, you're not pasting the same file over itself (in your current approach, you make the model containing the legacy code, then paste the compiled code over the top of the same file).

You're going to use it though so if you're happy with the current workflow then I defer to you 🙇

graciegoheen · 2022-10-05T14:18:51Z

@joellabes I'd like to stick with the current workflow - but I'll update the README to add some additional clarity!

joellabes

Let's rumble 🎉

Feature/generate_model_import_ctes

graciegoheen added 2 commits August 11, 2022 18:00

add generate_model_import_ctes macro and integration test

571b3ed

update readme

3a78210

graciegoheen requested a review from joellabes August 11, 2022 22:11

graciegoheen commented Aug 11, 2022

View reviewed changes

README.md Show resolved Hide resolved

graciegoheen and others added 2 commits August 11, 2022 18:12

Update README.md

538a46d

adjusted test example for postgres

e9c54f4

graciegoheen requested a review from b-per August 12, 2022 12:34

graciegoheen added 4 commits August 12, 2022 09:27

added support for config blocks

dd4dfc6

added support for comments

c07edba

added comma handling for sql with and without any CTEs

c563e23

Use unique instead of set to be compatible with more versions of dbt

8cbd058

graciegoheen requested a review from christineberger August 12, 2022 15:05

christineberger reviewed Aug 12, 2022

View reviewed changes

macros/generate_model_import_ctes.sql Show resolved Hide resolved

change CTE name to avoid duplicates

0963596

christineberger reviewed Aug 12, 2022

View reviewed changes

macros/generate_model_import_ctes.sql Show resolved Hide resolved

christineberger reviewed Aug 12, 2022

View reviewed changes

b-per reviewed Aug 12, 2022

View reviewed changes

macros/generate_model_import_ctes.sql Outdated Show resolved Hide resolved

graciegoheen and others added 2 commits August 12, 2022 17:11

fix for replacing without reliance on abc

45a844b

option to use raw_sql or raw_code depending on dbt version

df7c512

Co-authored-by: Benoit Perigaud <8754100+b-per@users.noreply.github.com>

graciegoheen added 2 commits August 12, 2022 17:24

commented out database example for integration tests

01dac05

Merge branch 'feature/generate_model_ctes' of https://github.com/dbt-…

66bddbb

…labs/dbt-codegen into feature/generate_model_ctes

graciegoheen added 2 commits September 9, 2022 13:54

Fixed missing commas issue

9b7a33e

comma fix

58ad058

graciegoheen requested review from b-per, christineberger and dave-connors-3 September 15, 2022 20:47

Added integration test to check for commas on sql without import ctes

ca67cea

cohms reviewed Sep 28, 2022

View reviewed changes

macros/generate_model_import_ctes.sql Show resolved Hide resolved

cohms reviewed Sep 28, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

cohms requested changes Sep 28, 2022

View reviewed changes

update readme

f536083

graciegoheen requested a review from cohms September 30, 2022 14:36

cohms approved these changes Sep 30, 2022

View reviewed changes

b-per requested changes Sep 30, 2022

View reviewed changes

Added space between import CTEs

2ff64be

graciegoheen requested a review from b-per September 30, 2022 17:39

b-per approved these changes Oct 3, 2022

View reviewed changes

dave-connors-3 approved these changes Oct 3, 2022

View reviewed changes

update README

df54f84

joellabes approved these changes Oct 6, 2022

View reviewed changes

joellabes merged commit 2f13a51 into main Oct 6, 2022

joellabes deleted the feature/generate_model_ctes branch October 6, 2022 08:44

jeremyholtzman pushed a commit that referenced this pull request Apr 10, 2023

Merge pull request #74 from dbt-labs/feature/generate_model_ctes

0d7ce15

Feature/generate_model_import_ctes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/generate_model_import_ctes #74

Feature/generate_model_import_ctes #74

graciegoheen commented Aug 11, 2022

christineberger commented Aug 12, 2022 •

edited

Loading

graciegoheen commented Aug 12, 2022 •

edited

Loading

christineberger commented Aug 12, 2022 •

edited

Loading

christineberger commented Aug 12, 2022 •

edited

Loading

christineberger Aug 12, 2022 •

edited

Loading

christineberger Aug 12, 2022

graciegoheen Aug 12, 2022

graciegoheen Aug 12, 2022

jeremyholtzman commented Aug 12, 2022

graciegoheen commented Aug 12, 2022

dbeatty10 commented Aug 13, 2022

dbeatty10 commented Aug 13, 2022

b-per commented Aug 15, 2022

christineberger commented Aug 31, 2022 •

edited

Loading

dave-connors-3 commented Sep 1, 2022

christineberger commented Sep 1, 2022

graciegoheen commented Sep 9, 2022

cohms left a comment

cohms left a comment

b-per left a comment

dave-connors-3 left a comment

joellabes commented Oct 4, 2022

graciegoheen commented Oct 4, 2022

joellabes commented Oct 4, 2022

graciegoheen commented Oct 5, 2022

joellabes left a comment

Feature/generate_model_import_ctes #74

Feature/generate_model_import_ctes #74

Conversation

graciegoheen commented Aug 11, 2022

Description & motivation

Checklist

christineberger commented Aug 12, 2022 • edited Loading

graciegoheen commented Aug 12, 2022 • edited Loading

christineberger commented Aug 12, 2022 • edited Loading

christineberger commented Aug 12, 2022 • edited Loading

christineberger Aug 12, 2022 • edited Loading

Choose a reason for hiding this comment

christineberger Aug 12, 2022

Choose a reason for hiding this comment

graciegoheen Aug 12, 2022

Choose a reason for hiding this comment

graciegoheen Aug 12, 2022

Choose a reason for hiding this comment

jeremyholtzman commented Aug 12, 2022

graciegoheen commented Aug 12, 2022

dbeatty10 commented Aug 13, 2022

Example

Install

Run

dbeatty10 commented Aug 13, 2022

Crazy idea # 1

Crazy idea # 2

b-per commented Aug 15, 2022

christineberger commented Aug 31, 2022 • edited Loading

Original Query

Output

dave-connors-3 commented Sep 1, 2022

christineberger commented Sep 1, 2022

Scenario 1 | Joins by Comma

Scenario 2 | Non-explicit table references

graciegoheen commented Sep 9, 2022

cohms left a comment

Choose a reason for hiding this comment

cohms left a comment

Choose a reason for hiding this comment

b-per left a comment

Choose a reason for hiding this comment

dave-connors-3 left a comment

Choose a reason for hiding this comment

joellabes commented Oct 4, 2022

graciegoheen commented Oct 4, 2022

joellabes commented Oct 4, 2022

graciegoheen commented Oct 5, 2022

joellabes left a comment

Choose a reason for hiding this comment

christineberger commented Aug 12, 2022 •

edited

Loading

graciegoheen commented Aug 12, 2022 •

edited

Loading

christineberger commented Aug 12, 2022 •

edited

Loading

christineberger commented Aug 12, 2022 •

edited

Loading

christineberger Aug 12, 2022 •

edited

Loading

christineberger commented Aug 31, 2022 •

edited

Loading