Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/generate_model_import_ctes #74

Merged
merged 21 commits into from
Oct 6, 2022
Merged

Conversation

graciegoheen
Copy link
Contributor

This is a:

  • bug fix PR with no breaking changes — please ensure the base branch is main
  • new functionality — please ensure the base branch is the latest dev/ branch
  • a breaking change — please ensure the base branch is the latest dev/ branch

Description & motivation

I created a macro to generate the SQL for a given model with all references pulled up into import CTEs, which you can then paste back into the model.

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@graciegoheen graciegoheen requested a review from joellabes August 11, 2022 22:11
README.md Show resolved Hide resolved
@graciegoheen graciegoheen requested a review from b-per August 12, 2022 12:34
@christineberger
Copy link

christineberger commented Aug 12, 2022

This is an awesome idea! I'm going to document my findings in this PR as I'm testing.

I wanted to test two things:

  • What does the macro do if there are more than one explicit reference to a table (for example, using orders outside of a subquery and using the same table inside of a subquery)
  • What does the macro do with explicit references when they are defined in different ways? For example, you could have an explicit reference like development.tpch_sf1.orders or tpch_sf1.orders.

I ran into my first issue, which is that the macro doesn't get the correct references and output the right import CTEs when the explicit reference has the database name attached.

Here is my model code:

WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from raw_tpch.tpch_sf1.orders o
    inner join raw_tpch.tpch_sf1.customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM raw_tpch.tpch_sf1.lineitem l LEFT JOIN raw_tpch.tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN raw_tpch.tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch.tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

and here is the compiled code from the macro:

with tpch_sf1 as (
    select * from raw_tpch.tpch_sf1
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from tpch_sf1.orders o
    inner join tpch_sf1.customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM tpch_sf1.lineitem l LEFT JOIN tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

Update
I managed to get some correct-ish output adding these lines (and I'm sure you can do this more concisely, but just wanted to get it to work):

  1. An addition to the matches: 'from_table_3':'(?i)(from|join)\s+([\["]?\w+[]\"]?\.)([\["]?\w+[]\"]?\.)([\["]?\w+[]\"]?)',
  2. An addition to the CTE naming logic:
                {%- if regex_name == 'from_table_3' -%}
                {%- set cte_name = match[3]|replace("'","")|trim|lower -%}
                {%- else -%}
                {%- set cte_name = match[2]|replace("'","")|trim|lower -%}
                {%- endif -%}
    
  3. An addition to the SQL generation: {%- elif from_obj[2] == 'from_table_1' or from_obj[2] == 'from_table_2' or from_obj[2] == 'from_table_3' %}

Output:

with raw_tpch_tpch_sf1_customer as (
    select * from raw_tpch.tpch_sf1.customer
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_lineitem as (
    select * from raw_tpch.tpch_sf1.lineitem
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_nation as (
    select * from raw_tpch.tpch_sf1.nation
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
raw_tpch_tpch_sf1_orders as (
    select * from raw_tpch.tpch_sf1.orders
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
tpch_sf1 as (
    select * from raw_tpch.tpch_sf1
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
WITH

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from raw_tpch_tpch_sf1_orders o
    inner join raw_tpch_tpch_sf1_customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Aug 12, 2022

@christineberger Can I ask what you think should be done in this situation? Can we make the assumption that development.tpch_sf1.orders and tpch_sf1.orders are the same? Or is it just a matter of giving them unique names? I'm imaging a scenario in which there is development.tpch_sf1.orders, tpch_sf1.orders, and {{ ref('orders') }} - should these be treated as separate import CTEs and if so, what should the CTE naming convention be?

@christineberger
Copy link

christineberger commented Aug 12, 2022

@graciegoheen That's a really good question that I'm not sure there's a straight answer to, but we could always just make a v1 decision - I think in order to avoid conflicts, maybe we just treat them as separate import CTEs and the user will need to resolve that themselves.

(I think) we can assume that if there's an explicit reference without a database name / schema name, it's referring to the target location. If not, then we know the explicit location - so we could name a references like this:

  • raw_data.salesforce.orders -> raw_data_salesforce_orders
  • salesforce.orders -> <target_database>_salesforce_orders
  • orders -> <target_database>_<target_schema>_orders

I haven't thought this all the way through, but what I'm trying to say is yes to separate import CTES!

@christineberger
Copy link

christineberger commented Aug 12, 2022

Also, I'm just documenting what is working and what's not working, and those can either be limitations of the v1 or fixed in this PR - There's absolutely no issue in getting it working "for the most part" as long as we have the limitations documented!

{%- set from_regexes = {
'from_ref':'(?i)(from|join)\s+({{\s*ref\s*\()([^)]+)(\)\s*}})',
'from_source':'(?i)(from|join)\s+({{\s*source\s*\([^)]+,)([^)]+)(\)\s*}})',
'from_table_1':'(?i)(from|join)\s+([\[`\"]?\w+[\]`\"]?)\.([\[`\"]?\w+[\]`\"]?)',
Copy link

@christineberger christineberger Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may or may not be an issue depending on how this is restructured to capture 3-part vs. 2-part vs 1-part explicit references, but in the output while accounting for a 3-part reference, I'm seeing an additional cte for raw_tpch.sf_1 with no table. This is because one of the regex matches are capturing the pattern for a 2-part reference. To fix this temporarily I added a $ to the end of from_table_1 in order to tell it "just match this and nothing after it", but there are probably many ways you could be more concise or offer more flexibility without limiting the regex match in that way. Something to consider as part of the explicit references piece!

Output:

with

raw_tpch_tpch_sf1_customer as (
    select * from raw_tpch.tpch_sf1.customer
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),

raw_tpch_tpch_sf1_lineitem as (
    select * from raw_tpch.tpch_sf1.lineitem
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),

raw_tpch_tpch_sf1_nation as (
    select * from raw_tpch.tpch_sf1.nation
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),

raw_tpch_tpch_sf1_orders as (
    select * from raw_tpch.tpch_sf1.orders
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),

-- Only orders which tie back to a real customer record
ro as (
    select
        o.o_orderkey,
        c.c_custkey
    from raw_tpch_tpch_sf1_orders o
    inner join raw_tpch_tpch_sf1_customer c
        on o.o_custkey = c.c_custkey
),

rl as (
    SELECT c.c_custkey
    ,c.c_name
    ,sum(revenue) AS revenue_lost
    ,c_acctbal
    ,n.n_name 
    ,c_address 
    ,c_phone 
    ,c_comment
    FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
        FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey
        WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
    ) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
    LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey
    WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)

select * from rl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in another version we can work on pulling out subqueries into CTEs 🤔 If they have any joins anywhere after "from", we can assume that they have aliased the subquery with or without using AS. We can take the alias and create a CTE for it using the subquery text. If there's no joins within the entire from statement, then we could just temporarily name them. I imagine this will warrant an entire macro to itself!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a re-work on the replacing logic to be more precise - using regex sub instead of replace. Let me know if you still have this issue!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also love the subquery idea - that would be a really cool addition!

@jeremyholtzman
Copy link
Contributor

@graciegoheen I don't think this has been mentioned yet, but I tested it out with the formatting of joins below, and it only caught the first table.

-- weird_joins.sql
select *
from 
    {{ ref('stg_tpch_customers') }},
    {{ ref('stg_tpch_nations') }}

where stg_tpch_customers.nation_id = stg_tpch_nations.nation_id 

Results:

{{ generate_model_import_ctes('weird_joins') }}
with stg_tpch_customers as (
    select * from {{ ref('stg_tpch_customers') }}
  
),
select *
from 
    stg_tpch_customers,
    {{ ref('stg_tpch_nations') }}

where stg_tpch_customers.nation_id = stg_tpch_nations.nation_id

graciegoheen and others added 2 commits August 12, 2022 17:11
Co-authored-by: Benoit Perigaud <8754100+b-per@users.noreply.github.com>
@graciegoheen
Copy link
Contributor Author

@jeremyholtzman TBH I didn't even know you could do joins like that - need to think of a way to actually capture that with regex...

@dbeatty10
Copy link
Contributor

@graciegoheen an idea for you to handle cases like the one that @jeremyholtzman surfaced:

  • import a SQL parser to do the dirty work for you

Here's a non-exhaustive list of parsers (your mileage will vary):

Example

SQLGlot is pretty cool -- it has a parser, transpiler, and optimizer, and (at the time of this writing) it has some level of dialect support for bigquery, clickhouse, duckdb, hive, mysql, oracle, postgres, presto, snowflake, spark, sqlite, starrocks, tableau (?!), and trino. (Kinda off-topic, but transpiling means it can read from one dialect and write out to another.)

Install

If you are using zsh or bash as your shell, you can copy-paste the commands below. Otherwise, follow the shell-specific instructions here.

python3 -m venv venv
source venv/bin/activate 
python3 -m pip install sqlglot

Run

Launch Python in interactive mode:

python

Create a SQL query using the Postgres dialect:

from sqlglot import parse_one, exp

# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z", read="postgres").find_all(exp.Table):
    print(table.name)

This and other examples in their documentation.

@dbeatty10
Copy link
Contributor

Crazy idea # 1

  • look at the optimizer portion of SQLGlot for inspiration how to re-write a SQL query (using the parsed AST) with all references pulled up into import CTEs

Crazy idea # 2

(Which is not actually related to this pull request!)

This dialect could replace database-specific keywords, functions, etc with their cross-database macro equivalents.

Idea # 2 would be most relevant for code that needs/wants to be cross-database compatible, like packages on the dbt Hub.

@b-per
Copy link
Contributor

b-per commented Aug 15, 2022

I personally feel that we might tweak the regexes to make them work in most of the cases (even if they won't work 100% of the time) but avoid using third-party libraries. (I'd be keen to have as many tests as possible on the regexes though because backward compatibility can be difficult to identify with regexes whenever we do a change)

  1. Without third-party libraries, we can run the macro in dbt Cloud (even though we may not catch 100% of the use cases)
  2. I think that SQLFluff already handles getting it 100% correct (see Rule 42 that supports the fix command) if someone can use Python locally

@christineberger
Copy link

christineberger commented Aug 31, 2022

@graciegoheen there are commas missing after the import CTEs in my output:

Original Query

select customer_id,
        customer_name,
        coalesce(purchase_total,0) as purchase_total,
        coalesce(lr.revenue_lost, 0) as return_total,
        coalesce(purchase_total - return_total, 0) as lifetime_value,
        (coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
        case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
    from 
        (select c.c_custkey as customer_id,
            c.c_name as customer_name,
            round(sum(customer_cost), 2) as purchase_total
        from {{ source('TPCH_SF1', 'CUSTOMER') }} c
            left join {{ ref('order_line_items') }} oli
                on c.c_custkey = oli.customer_id
            where item_status != 'R'
            group by c_custkey, c_name
        order by c_custkey) co
        left join 
            {{ ref('lost_revenue') }} lr
                on co.customer_id = lr.c_custkey

Output

with lost_revenue as (

    select * from {{ ref('lost_revenue') }}
  
)
order_line_items as (

    select * from {{ ref('order_line_items') }}
  
)
source_customer as (

    select * from {{ source('TPCH_SF1', 'CUSTOMER') }} 
    -- CAUTION: It's best practice to create staging layer for raw sources
  
)
select customer_id,
        customer_name,
        coalesce(purchase_total,0) as purchase_total,
        coalesce(lr.revenue_lost, 0) as return_total,
        coalesce(purchase_total - return_total, 0) as lifetime_value,
        (coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
        case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
    from 
        (select c.c_custkey as customer_id,
            c.c_name as customer_name,
            round(sum(customer_cost), 2) as purchase_total
        from source_CUSTOMER c
            left join order_line_items oli
                on c.c_custkey = oli.customer_id
            where item_status != 'R'
            group by c_custkey, c_name
        order by c_custkey) co
        left join lost_revenue lr
                on co.customer_id = lr.c_custkey

@dave-connors-3
Copy link
Contributor

@graciegoheen -- I am also missing commas! I ran this against my coalesce demo project from last year

Input
-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere

select 
    c.id as claim_id,
    p.name_f || ' ' || p.name_l as pat_name,
    p.dob,
    d.name as doctor_name,
    d.npi,
    (select max(specialty) from development.dbt_dconnors_ehr_data.doc_specialties ds where ds.doc_id = d.id) as spec_name,
    prim_diag.icd_10_code as primary_diag,
    prim_diag.icd_10_code_descrip as primary_diag_desc,
    sec_diag.icd_10_code as sec_diag,
    sec_diag.icd_10_code_descrip as sec_diag_desc,
    c.ClaimNumber,
    ca.total_claim_amnt,
    ca.paid_amnt
from 
    development.dbt_dconnors_ehr_data.patients p
    join development.dbt_dconnors_billing_data.claims c
        on p.id = c.pat_id
    left join (
        select 
            cl.claim_id, 
            sum(chrgamnt) total_claim_amnt,
            sum(paid_amnt) paid_amnt
        from development.dbt_dconnors_billing_data.claim_line cl
        where status in ('billed', 'adjudicated', 'closed')
        group by 1
    ) ca 
        on c.id = ca.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development.dbt_dconnors_billing_data.claim_diagnosis
        where pos = 1
    ) as prim_diag
    on c.id = prim_diag.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development.dbt_dconnors_billing_data.claim_diagnosis
        where pos = 2
    ) as sec_diag
    on c.id = sec_diag.claim_id
    left join development.dbt_dconnors_ehr_data.doctors d
        on c.doc_id = d.id 

where c.test = false 
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0
Output
with development_dbt_dconnors_billing_data_claim_diagnosis as (

    select * from development.dbt_dconnors_billing_data.claim_diagnosis
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_billing_data_claim_line as (

    select * from development.dbt_dconnors_billing_data.claim_line
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_billing_data_claims as (

    select * from development.dbt_dconnors_billing_data.claims
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_doc_specialties as (

    select * from development.dbt_dconnors_ehr_data.doc_specialties
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_doctors as (

    select * from development.dbt_dconnors_ehr_data.doctors
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
development_dbt_dconnors_ehr_data_patients as (

    select * from development.dbt_dconnors_ehr_data.patients
    -- CAUTION: It's best practice to use the ref or source function instead of a direct reference
  
)
-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere

select 
    c.id as claim_id,
    p.name_f || ' ' || p.name_l as pat_name,
    p.dob,
    d.name as doctor_name,
    d.npi,
    (select max(specialty) from development_dbt_dconnors_ehr_data_doc_specialties ds where ds.doc_id = d.id) as spec_name,
    prim_diag.icd_10_code as primary_diag,
    prim_diag.icd_10_code_descrip as primary_diag_desc,
    sec_diag.icd_10_code as sec_diag,
    sec_diag.icd_10_code_descrip as sec_diag_desc,
    c.ClaimNumber,
    ca.total_claim_amnt,
    ca.paid_amnt
from development_dbt_dconnors_ehr_data_patients p
    join development_dbt_dconnors_billing_data_claims c
        on p.id = c.pat_id
    left join (
        select 
            cl.claim_id, 
            sum(chrgamnt) total_claim_amnt,
            sum(paid_amnt) paid_amnt
        from development_dbt_dconnors_billing_data_claim_line cl
        where status in ('billed', 'adjudicated', 'closed')
        group by 1
    ) ca 
        on c.id = ca.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development_dbt_dconnors_billing_data_claim_diagnosis
        where pos = 1
    ) as prim_diag
    on c.id = prim_diag.claim_id
    left join (
        select 
            claim_id,
            icd10 as icd_10_code,
            icd10desc as icd_10_code_descrip,
            split_part(icd10, '.', 1) as parent_icd10
        from development_dbt_dconnors_billing_data_claim_diagnosis
        where pos = 2
    ) as sec_diag
    on c.id = sec_diag.claim_id
    left join development_dbt_dconnors_ehr_data_doctors d
        on c.doc_id = d.id 

where c.test = false 
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0

@christineberger
Copy link

Leaving some additional comments here, but I'm not exactly sure if you wanted to account for these scenarios or not! I personally don't think they are worth holding up this PR for, so happy to approve after the missing commas after multiple ctes issue is fixed. This macro can be iterated on for these if we think they're worth it, and people should always be checking what's generated and tweak the output if they need to anyway.

Scenario 1 | Joins by Comma

Input
{{
    config(
        materialized='table'
    )
}}
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM 
    {{ source('TPCH_SF1', 'PARTSUPP') }},
    {{ source('TPCH_SF1', 'SUPPLIER') }},
    {{ source('TPCH_SF1', 'NATION') }},
    {{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY
Output
{{
    config(
        materialized='table'
    )
}}

with source_partsupp as (

    select * from {{ source('TPCH_SF1', 'PARTSUPP') }}
  
)
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM source_PARTSUPP,
    {{ source('TPCH_SF1', 'SUPPLIER') }},
    {{ source('TPCH_SF1', 'NATION') }},
    {{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY

Scenario 2 | Non-explicit table references

I do concede this is a bad practice and probably an edge case since it would require the person's target to have all objects that they could be referring to, but if someone is migrating a script this very well could be the case.

Input
with 

select 
    e.p_name            AS Part_Name,
    e.p_retailprice     AS RetailPrice,
    e.s_name            AS Supplier_Name,
    e.p_mfgr            AS Part_Manufacturer,
    e.s_address         AS SuppAddr,
    e.s_phone           AS Supp_Phone,
    ps.PS_AVAILQTY      AS Num_Available 

from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }} e
LEFT JOIN {{ source('TPCH_SF1', 'SUPPLIER') }} s on e.S_NAME = s.S_NAME
LEFT JOIN
    partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
    and s.S_SUPPKEY = ps.PS_SUPPKEY

where n_name = 'UNITED KINGDOM'
Output
with eur_lowcost_brass_suppliers as (

    select * from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }}
  
),
source_supplier as (

    select * from {{ source('TPCH_SF1', 'SUPPLIER') }} 
    -- CAUTION: It's best practice to create staging layer for raw sources
  
),
select 
    e.p_name            AS Part_Name,
    e.p_retailprice     AS RetailPrice,
    e.s_name            AS Supplier_Name,
    e.p_mfgr            AS Part_Manufacturer,
    e.s_address         AS SuppAddr,
    e.s_phone           AS Supp_Phone,
    ps.PS_AVAILQTY      AS Num_Available 

from EUR_LOWCOST_BRASS_SUPPLIERS e
LEFT JOIN source_SUPPLIER s on e.S_NAME = s.S_NAME
LEFT JOIN
    partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
    and s.S_SUPPKEY = ps.PS_SUPPKEY

where n_name = 'UNITED KINGDOM'

@graciegoheen
Copy link
Contributor Author

@christineberger and @dave-connors-3 the missing comma issue should now be resolved!

Thank you Christine for identifying those outstanding issues. I think I'm okay with saying "Joins by Comma" is out of scope for this PR (but curious if other people disagree). "Non-explicit table references" would be tricky to capture, because the syntax for that is identical to selecting from a cte. How often does that come up? We could also call these out as two known issues (either in the README or actually opening up github issues) that won't currently be covered by this macro. That way, at least no one is surprised that it didn't work for their use case.

README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@cohms cohms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this out on BigQuery and it worked! Only change I think should be made is to the README, and be explicit in the instructions to call codegen.generate_model_import_ctes to prevent errors and to match other documentation.

Not worth changing but a heads up: the only other thing I noticed was that if you have some import CTEs but not all, it will create semi-duplicate CTEs at the top (I'd already created the dependencies import CTE)
image

BUT not something I think you should change or worry about. The use case is clearly for the code that doesn't have any import CTEs rather than some. And extra CTEs won't hurt even if they do already have some import CTEs.

@graciegoheen graciegoheen requested a review from cohms September 30, 2022 14:36
Copy link
Contributor

@cohms cohms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a few queries and it worked fine. Awesome work Grace, this will be super valuable for refactoring code!

My only comment would be to slightly change the code to generate SQL that follows our new conventions for CTEs, i.e.

cte1 as (
    ...
),

cte2 as (
    ...
),

with 1 blank row between CTEs

@graciegoheen graciegoheen requested a review from b-per September 30, 2022 17:39
Copy link
Contributor

@dave-connors-3 dave-connors-3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦞

@joellabes
Copy link
Contributor

@graciegoheen Are you ready for this to be merged and released?

I haven't done any sort of review of this, but I did have a peek at the readme. My one piece of feedback is that I assumed this was taking a string of code and reformatting it; turns out it takes a model and reformats raw_sql or raw_code. Curious as to why that was the approach (as opposed to audit_helper's pattern where you create your sql strings and pass them into the macro), but not against it at all!

However, I think it would be useful to clarify that in your README - as a foolish optimist who has never worked on proserv, I assume that models are already clean and follow dbt best practices, so I couldn't work out what import CTEs you were going to create.

Maybe you say "make a model with the janky code, and then point this macro at it" or something?

Or consider doing it as

{% set old_query %}
-- bad query here
{% endset %}

{{ codegen.generate_model_import_ctes(old_query) }}

@graciegoheen
Copy link
Contributor Author

@joellabes Yes, I'm feeling ready!

Great feedback. Since all of the other macros in this package take either a source or model name as the input, I was following the same pattern for this one. If this belongs in audit_helper instead (or a new package entirely), passing in a string of code could make more sense. Even so, typically our pattern for migrating code is:

  • make a model in dbt
  • copy/paste the legacy code
  • refactor! refactor! refactor!

So, I was imagining the workflow where you've created your model in dbt, copy/pasted in your old code, then want to generate the same sql with references to tables pulled out into import CTEs.

To me, opening up a statement tab, calling this macro, compiling the code, and copy/pasting in the compiled output feels more intuitive (in the context of the other codegen macros) than nesting the legacy code in a set / endset block and then calling the macro. But could totally be the minority here!

@joellabes
Copy link
Contributor

To me, opening up a statement tab, calling this macro, compiling the code, and copy/pasting in the compiled output feels more intuitive

Yeah I still thought this would happen in a scratchpad - I would have imagined this workflow:

  • Copy the legacy code out of whatever tool currently contains it
  • Paste it into a scratchpad file in dbt Cloud between set/endset tags
    • The macro is called at the bottom of the scratchpad file
  • Compile it
  • Copy the resulting compiled code into a new dbt model
  • refactor! refactor! refactor!

By doing that, you're not pasting the same file over itself (in your current approach, you make the model containing the legacy code, then paste the compiled code over the top of the same file).

You're going to use it though so if you're happy with the current workflow then I defer to you 🙇

@graciegoheen
Copy link
Contributor Author

@joellabes I'd like to stick with the current workflow - but I'll update the README to add some additional clarity!

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rumble 🎉

@joellabes joellabes merged commit 2f13a51 into main Oct 6, 2022
@joellabes joellabes deleted the feature/generate_model_ctes branch October 6, 2022 08:44
jeremyholtzman pushed a commit that referenced this pull request Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants