Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

sisu-callum · 2022-02-11T18:58:23Z

While testing this package I noticed that the resulting table contains all possible combinations of values contained within the dimension list, as opposed to all of the possible combinations. This creates rows where the combination of values could never occur in the base dataset.

For example, the testing case I was going through contained two dimensions called user_name and organization_name. There's a relationship between those 2 fields in that an organization can have multiple users but a user is only going to be part of 1 organization.

The dataset produced by the macro has rows for a particular in every potential organization.

In the actual metric calculation, that's not that big of a deal because the values provided are equal to 0 but it could be a bit of an issue when working with BI tools. Drop downs / filters would contain combinations that aren't possible and potentially give the data consumers the wrong conclusion of the relationship between those two fields.

Not sure if this is intended behavior but I figured I'd flag just in case!

joellabes · 2022-02-14T02:16:04Z

Hmm that's an interesting one! We've taken the cross-join approach to have a consistent spine across all dimensions, but you're right that your use case isn't compatible with that.

Can you share a small worked example of what you're trying to calculate, with how it's currently behaving and how you'd rather it work? I can imagine doing something like only date spining a combination of columns if they already exist somewhere else in the dataset, but it's still pretty nebulous

sisu-callum · 2022-02-14T03:45:41Z

Happy to! Right now, I'm working with a dataset where each row represents a user-session. A very simplistic version of the dataset would look like this:

Person	Organization	Session Time	Session Date
Joel	dbt-labs	2	2022-02-13
Callum	Sisu	4	2022-02-13

When I use the metric package, the result produced looks like:

Date_Day	Person	Organization	Metric_Name
2022-02-13	Joel	dbt-labs	2
2022-02-13	Joel	Sisu	0
2022-02-13	Callum	dbt-labs	0
2022-02-13	Callum	Sisu	4

At small scale with low cardinality columns, it isn't really that big of a deal. But with high cardinality columns it expands out massively. So I think date spining the combination of columns that already exist in the dataset would be my ideal solution. I also think this would massively reduce the compute cost of adding additional fields by lowering the size of the output? I say this because I experimented on 49 rows and 8 dimensions where the resulting output was 81m rows (this is where I stumbled onto the exact user/organization example expressed above). Hope that helps clarify!

sisu-callum · 2022-03-03T21:02:13Z

@joellabes I know this issue is more of an strategic question and less of a 'how should we do it' but I decided to test it out on my repo this afternoon and was able to make it work with the following changes to the spine__values and spine sections.

),
spine__values as (

select distinct 
{%- for dim in dimensions -%}    
    {%- if metrics.is_dim_from_model(metric, dim) %}
        {{dim}}
        {% if not loop.last %},{% endif %}
    {% endif -%}
{%- endfor %}
from source_query

),
spine as (

select *
from spine__time
cross join spine__values

),

Now, I'll caveat all that follows by saying I only really tested it on a single metric with two dimensions. But the results that I saw were pretty significant.

The overall row size of the table output dropped from 400k to 11k.
The runtime of the model dropped from 14.3s to 2.91s (though Snowflake caching might have been helping out here)

joellabes · 2022-03-03T21:17:57Z

Yay! I haven’t forgotten you - I just need to get some more tests in before starting to change the macros' behaviours, and before I could do that I had to wrap up dbt-labs/dbt-core#4813 😅

This sounds extremely reasonable! Once I get my act together on testing (hoping to be next week) I'd happily merge a PR to that effect, assuming it still returned the same results.

joellabes · 2022-03-06T23:15:33Z

I thought about this more over the weekend, and I would still like to support this use case, but I think both have a place so we might have to shudder add an option!

@drewbanin what do you think? In Callum's case, a cross join of every dimension is nonsensical, but for low-cardinality use cases I can still see it being valuable.

sisu-callum mentioned this issue Mar 7, 2022

Cartesian Join Update #21

Closed

callum-mcdata mentioned this issue May 20, 2022

Removing Cartesian Join #31

Merged

callum-mcdata added the performance Improvements to performance and speed label Jun 1, 2022

callum-mcdata self-assigned this Jun 1, 2022

callum-mcdata linked a pull request Jun 1, 2022 that will close this issue

Removing Cartesian Join #31

Merged

callum-mcdata closed this as completed in #31 Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

sisu-callum commented Feb 11, 2022

joellabes commented Feb 14, 2022

sisu-callum commented Feb 14, 2022

sisu-callum commented Mar 3, 2022

joellabes commented Mar 3, 2022

joellabes commented Mar 6, 2022

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

Comments

sisu-callum commented Feb 11, 2022

joellabes commented Feb 14, 2022

sisu-callum commented Feb 14, 2022

sisu-callum commented Mar 3, 2022

joellabes commented Mar 3, 2022

joellabes commented Mar 6, 2022