-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: cross join in merge() and join() #5401
Comments
Why did you close this? |
Oops my bad it's totally by accident, hit the button somehow :( |
related #3835 , there is a method: how is this different from an outer join? (which does this and just drops the na groups) |
@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow? |
A bit hacky
|
@jreback yes, good one! But what if there're more than one column in either dataframe, or in both? |
not sure....can you create input and output frames? |
here's sql fiddle example of cross join |
FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way. |
I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:
When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on... |
Is there any reason not to implement this? The only thing I'm not sure of is what we expect from
do we want? |
+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well. In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining |
Because concat doesn't give you the Cartesian product:
|
It's worth noting that attempting a
|
We need to differentiate same name columns because the cartesian product can potentially generate different values pcloud=# select * from orders, customer2;
number | order_date | cust_id | salesperson_id | amount | cust_id | name | city | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
10 | 1996-08-02 | 4 | 2 | 540 | 4 | Samsonic | pleasant | J
10 | 1996-08-02 | 4 | 2 | 540 | 6 | Panasung | oaktown | J
10 | 1996-08-02 | 4 | 2 | 540 | 7 | Samony | jackson | B
10 | 1996-08-02 | 4 | 2 | 540 | 9 | Orange | Jackson | B
20 | 1999-01-30 | 4 | 8 | 1800 | 4 | Samsonic | pleasant | J
20 | 1999-01-30 | 4 | 8 | 1800 | 6 | Panasung | oaktown | J
20 | 1999-01-30 | 4 | 8 | 1800 | 7 | Samony | jackson | B
20 | 1999-01-30 | 4 | 8 | 1800 | 9 | Orange | Jackson | B
30 | 1995-07-14 | 9 | 1 | 460 | 4 | Samsonic | pleasant | J
30 | 1995-07-14 | 9 | 1 | 460 | 6 | Panasung | oaktown | J
30 | 1995-07-14 | 9 | 1 | 460 | 7 | Samony | jackson | B
30 | 1995-07-14 | 9 | 1 | 460 | 9 | Orange | Jackson | B
40 | 1998-01-29 | 7 | 2 | 2400 | 4 | Samsonic | pleasant | J
40 | 1998-01-29 | 7 | 2 | 2400 | 6 | Panasung | oaktown | J
40 | 1998-01-29 | 7 | 2 | 2400 | 7 | Samony | jackson | B
40 | 1998-01-29 | 7 | 2 | 2400 | 9 | Orange | Jackson | B
50 | 1998-02-03 | 6 | 7 | 600 | 4 | Samsonic | pleasant | J
50 | 1998-02-03 | 6 | 7 | 600 | 6 | Panasung | oaktown | J
50 | 1998-02-03 | 6 | 7 | 600 | 7 | Samony | jackson | B
50 | 1998-02-03 | 6 | 7 | 600 | 9 | Orange | Jackson | B
60 | 1998-03-02 | 6 | 7 | 720 | 4 | Samsonic | pleasant | J
60 | 1998-03-02 | 6 | 7 | 720 | 6 | Panasung | oaktown | J
60 | 1998-03-02 | 6 | 7 | 720 | 7 | Samony | jackson | B
60 | 1998-03-02 | 6 | 7 | 720 | 9 | Orange | Jackson | B
70 | 1998-05-06 | 9 | 7 | 150 | 4 | Samsonic | pleasant | J
70 | 1998-05-06 | 9 | 7 | 150 | 6 | Panasung | oaktown | J
70 | 1998-05-06 | 9 | 7 | 150 | 7 | Samony | jackson | B
70 | 1998-05-06 | 9 | 7 | 150 | 9 | Orange | Jackson | B
80 | 1999-07-23 | 4 | 12 | 200 | 4 | Samsonic | pleasant | J
80 | 1999-07-23 | 4 | 12 | 200 | 6 | Panasung | oaktown | J
80 | 1999-07-23 | 4 | 12 | 200 | 7 | Samony | jackson | B
80 | 1999-07-23 | 4 | 12 | 200 | 9 | Orange | Jackson | B
90 | 2000-07-23 | 9 | 12 | 2300 | 4 | Samsonic | pleasant | J
90 | 2000-07-23 | 9 | 12 | 2300 | 6 | Panasung | oaktown | J
90 | 2000-07-23 | 9 | 12 | 2300 | 7 | Samony | jackson | B
90 | 2000-07-23 | 9 | 12 | 2300 | 9 | Orange | Jackson | B
(36 rows) |
Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them? |
@dsm054 Yep. |
I've got this working for |
Any update on this? Would love to see this feature/enhancement in an upcoming release |
For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition ( |
@pkch |
I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the For For For My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:
That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward? |
I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21. |
I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for OverviewBased on the earlier discussion, I'm proposing a new option As an example:
Parameters
IndexFor me, the interesting question is what to do with
Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling Disadvantage: pandas code is already too cluttered with
In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20]) In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40]) In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1) Out[74]: A B 0 1 3 1 1 4 2 1 5 3 2 3 4 2 4 5 2 5 In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1) Out[75]: A B 20 1 3 30 1 4 40 1 5 20 2 3 30 2 4 40 2 5 In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1) Out[76]: A B 10 1 3 10 1 4 10 1 5 20 2 3 20 2 4 20 2 5 In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1) Out[77]: A B 20 2 3 That is, in the case of Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now. Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify 3a) Do the same as in #2, but instead of keeping only the common indices when
Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with Disadvantage: now it really feels like we're reinterpreting 3b) Do the same as in (3a), but change the interpretation of the index flags in the case of Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index. Disadvantage: inverts the behaviour from anyone used to the ConclusionThe fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements. I don't think I've ever used Thoughts? |
Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,
You mentioned that 3b inverts the usual interpretation of As for option 1, can you elaborate on your comment about |
On 1, I meant that you can reset_index before you call merge:
On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do
which is the opposite relation of X_index=True to the resulting index from the one you might expect. BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together. |
Thanks, got it! Seems like for consistency, #16228 should also aim to forbid If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go. |
yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting) |
I come up with an idea to do it:
It might be too easy to build a particular function `merge(df1,df2,how='cross') for it. |
Okay, I want to bring this back. If to avoid ambiguity and weird corner cases of interpretation, if we disallow on, left_on, right_on, left_index, and right_index, then we can pretty easily have
@jreback: any objection if I bring this back with a PR? |
Hi, I've recently written a canonical post on computing the cartesian product of two DataFrames on SO.
This is just a very basic example and does not handle column names and suffixes. It is by no means the fastest possible method, but I feel it could be a good starting point in terms of simplicity and performance. I have added performance benchmarks in the answer, if you would like to check it out. |
@dsm054 have to have this, I think should address #16228 first, then the choice of what to do is easier. I would simply do the 'easy' impl of adding a column (you could prob have to generate a temporary hash column to avoid columns) is fine. In a later PR could explore if this is performant (might be ok). @Coldsp33d I am not sure that soln is going to work on mixed dtypes, nor will it easily handle things like column suffixing. |
@jreback Fair point! Perhaps we can substitute the calls to |
Was there ever any conclusion here? how='cross' would be super helpful. |
a pull request for a naive implementation would be a good start |
It looks like it was not implemented yet... |
Hi, Is this topic abandoned? |
Really expect this..... |
Seems, this has not implemented yet. I'm using the latest version-1.0.3 |
Will this feature get implemented or it's still in discussion ? Thank You |
@ishmeetk thinhs are implemented when someone from the community submits a pull request. this is not hard and would welcome it the core team can provide review |
related here: #7203
If I haven't missed something, there's no simple way to make cartesian product method in pandas. For example, see http://stackoverflow.com/questions/19684765/pandas-extending-the-dataframe-by-adding-another-levels/19684948#19684948.
To do cross join, one have to add some keys to DataFrames, like this:
How about adding possibility to pass how='cross' into merge() and join() methods?
The text was updated successfully, but these errors were encountered: