-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cartesian Joins Create ALL Combinations, Not All Possible Combinations #9
Comments
Hmm that's an interesting one! We've taken the cross-join approach to have a consistent spine across all dimensions, but you're right that your use case isn't compatible with that. Can you share a small worked example of what you're trying to calculate, with how it's currently behaving and how you'd rather it work? I can imagine doing something like only date spining a combination of columns if they already exist somewhere else in the dataset, but it's still pretty nebulous |
Happy to! Right now, I'm working with a dataset where each row represents a user-session. A very simplistic version of the dataset would look like this:
When I use the metric package, the result produced looks like:
At small scale with low cardinality columns, it isn't really that big of a deal. But with high cardinality columns it expands out massively. So I think date spining the combination of columns that already exist in the dataset would be my ideal solution. I also think this would massively reduce the compute cost of adding additional fields by lowering the size of the output? I say this because I experimented on 49 rows and 8 dimensions where the resulting output was 81m rows (this is where I stumbled onto the exact user/organization example expressed above). Hope that helps clarify! |
@joellabes I know this issue is more of an strategic question and less of a 'how should we do it' but I decided to test it out on my repo this afternoon and was able to make it work with the following changes to the
Now, I'll caveat all that follows by saying I only really tested it on a single metric with two dimensions. But the results that I saw were pretty significant.
|
Yay! I haven’t forgotten you - I just need to get some more tests in before starting to change the macros' behaviours, and before I could do that I had to wrap up dbt-labs/dbt-core#4813 😅 This sounds extremely reasonable! Once I get my act together on testing (hoping to be next week) I'd happily merge a PR to that effect, assuming it still returned the same results. |
I thought about this more over the weekend, and I would still like to support this use case, but I think both have a place so we might have to shudder add an option! @drewbanin what do you think? In Callum's case, a cross join of every dimension is nonsensical, but for low-cardinality use cases I can still see it being valuable. |
While testing this package I noticed that the resulting table contains all possible combinations of values contained within the dimension list, as opposed to all of the possible combinations. This creates rows where the combination of values could never occur in the base dataset.
For example, the testing case I was going through contained two dimensions called user_name and organization_name. There's a relationship between those 2 fields in that an organization can have multiple users but a user is only going to be part of 1 organization.
The dataset produced by the macro has rows for a particular in every potential organization.
In the actual metric calculation, that's not that big of a deal because the values provided are equal to 0 but it could be a bit of an issue when working with BI tools. Drop downs / filters would contain combinations that aren't possible and potentially give the data consumers the wrong conclusion of the relationship between those two fields.
Not sure if this is intended behavior but I figured I'd flag just in case!
The text was updated successfully, but these errors were encountered: