Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

Closed
sisu-callum opened this issue Feb 11, 2022 · 5 comments · Fixed by #31
Closed

Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9

sisu-callum opened this issue Feb 11, 2022 · 5 comments · Fixed by #31
Assignees
Labels
performance Improvements to performance and speed

Comments

@sisu-callum
Copy link
Contributor

While testing this package I noticed that the resulting table contains all possible combinations of values contained within the dimension list, as opposed to all of the possible combinations. This creates rows where the combination of values could never occur in the base dataset.

For example, the testing case I was going through contained two dimensions called user_name and organization_name. There's a relationship between those 2 fields in that an organization can have multiple users but a user is only going to be part of 1 organization.

The dataset produced by the macro has rows for a particular in every potential organization.

In the actual metric calculation, that's not that big of a deal because the values provided are equal to 0 but it could be a bit of an issue when working with BI tools. Drop downs / filters would contain combinations that aren't possible and potentially give the data consumers the wrong conclusion of the relationship between those two fields.

Not sure if this is intended behavior but I figured I'd flag just in case!

@joellabes
Copy link
Contributor

Hmm that's an interesting one! We've taken the cross-join approach to have a consistent spine across all dimensions, but you're right that your use case isn't compatible with that.

Can you share a small worked example of what you're trying to calculate, with how it's currently behaving and how you'd rather it work? I can imagine doing something like only date spining a combination of columns if they already exist somewhere else in the dataset, but it's still pretty nebulous

@sisu-callum
Copy link
Contributor Author

Happy to! Right now, I'm working with a dataset where each row represents a user-session. A very simplistic version of the dataset would look like this:

Person Organization Session Time Session Date
Joel dbt-labs 2 2022-02-13
Callum Sisu 4 2022-02-13

When I use the metric package, the result produced looks like:

Date_Day Person Organization Metric_Name
2022-02-13 Joel dbt-labs 2
2022-02-13 Joel Sisu 0
2022-02-13 Callum dbt-labs 0
2022-02-13 Callum Sisu 4

At small scale with low cardinality columns, it isn't really that big of a deal. But with high cardinality columns it expands out massively. So I think date spining the combination of columns that already exist in the dataset would be my ideal solution. I also think this would massively reduce the compute cost of adding additional fields by lowering the size of the output? I say this because I experimented on 49 rows and 8 dimensions where the resulting output was 81m rows (this is where I stumbled onto the exact user/organization example expressed above). Hope that helps clarify!

@sisu-callum
Copy link
Contributor Author

@joellabes I know this issue is more of an strategic question and less of a 'how should we do it' but I decided to test it out on my repo this afternoon and was able to make it work with the following changes to the spine__values and spine sections.

),
spine__values as (

select distinct 
{%- for dim in dimensions -%}    
    {%- if metrics.is_dim_from_model(metric, dim) %}
        {{dim}}
        {% if not loop.last %},{% endif %}
    {% endif -%}
{%- endfor %}
from source_query

),
spine as (

select *
from spine__time
cross join spine__values

),

Now, I'll caveat all that follows by saying I only really tested it on a single metric with two dimensions. But the results that I saw were pretty significant.

  • The overall row size of the table output dropped from 400k to 11k.
  • The runtime of the model dropped from 14.3s to 2.91s (though Snowflake caching might have been helping out here)

@joellabes
Copy link
Contributor

Yay! I haven’t forgotten you - I just need to get some more tests in before starting to change the macros' behaviours, and before I could do that I had to wrap up dbt-labs/dbt-core#4813 😅

This sounds extremely reasonable! Once I get my act together on testing (hoping to be next week) I'd happily merge a PR to that effect, assuming it still returned the same results.

@joellabes
Copy link
Contributor

I thought about this more over the weekend, and I would still like to support this use case, but I think both have a place so we might have to shudder add an option!

@drewbanin what do you think? In Callum's case, a cross join of every dimension is nonsensical, but for low-cardinality use cases I can still see it being valuable.

@callum-mcdata callum-mcdata added the performance Improvements to performance and speed label Jun 1, 2022
@callum-mcdata callum-mcdata self-assigned this Jun 1, 2022
@callum-mcdata callum-mcdata linked a pull request Jun 1, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to performance and speed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants