-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement cross method for Merge Operations #37864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not bad, can you add some asv's as well
pandas/core/reshape/merge.py
Outdated
@@ -591,6 +591,8 @@ def __init__( | |||
): | |||
_left = _validate_operand(left) | |||
_right = _validate_operand(right) | |||
if how == "cross": | |||
_left, _right, how, on = self._create_cross_configuration(_left, _right, on) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return the new column here (in addition to the other values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/reshape/merge.py
Outdated
return result.__finalize__(self, method="merge") | ||
|
||
def _maybe_drop_cross_column(self, result: "DataFrame"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass the col in here (type the output)
pandas/core/reshape/merge.py
Outdated
def _create_cross_configuration( | ||
self, _left, _right, on | ||
) -> Tuple["DataFrame", "DataFrame", str, str]: | ||
if on is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this validation should be done first (IOW maybe move create_cross_configuration lower where you are calling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, that I understand you correctly: Calling the function _create_cross_configuration
should be done after calling _validate_specification
? in this case I have moved the parts around
pandas/core/reshape/merge.py
Outdated
cross_col = f"{max([*_left.columns, *_right.columns])}_cross" | ||
_left = _left.copy() | ||
_right = _right.copy() | ||
_left.insert(loc=0, value=1, column=cross_col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use .assign (IOW put it at the end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/reshape/merge.py
Outdated
def _validate_specification(self): | ||
if hasattr(self, "_cross"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of checking the attribute, rather can you check if how is 'cross'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on answer above, this is done
Thx for the feedback, will start to implement the changes in the evening probably |
pandas/core/reshape/merge.py
Outdated
or self.on is not None | ||
): | ||
raise MergeError( | ||
"Can not pass any merge columns when using cross as merge method" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you say that left_on,right_on,on must be None, and left_index,right_index must be False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, pls also add all examples from the OP (hopefully was more than one).
also might be some examples on SO can add as tests.
pandas/core/frame.py
Outdated
@@ -221,6 +223,11 @@ | |||
join; sort keys lexicographically. | |||
* inner: use intersection of keys from both frames, similar to a SQL inner | |||
join; preserve the order of the left keys. | |||
* cross: creates the karthesian product from both frames, preserves the order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cartesian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
German influence, sorry :)
@@ -341,6 +348,24 @@ | |||
... | |||
ValueError: columns overlap but no suffix specified: | |||
Index(['value'], dtype='object') | |||
|
|||
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example of an inner and left merge here (and put them right before this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, could you please check if this is similar to what you have in mind?
pandas/core/reshape/merge.py
Outdated
@@ -1200,9 +1218,50 @@ def _maybe_coerce_merge_keys(self): | |||
typ = rk.categories.dtype if rk_is_cat else object | |||
self.right = self.right.assign(**{name: self.right[name].astype(typ)}) | |||
|
|||
def _create_cross_configuration( | |||
self, _left, _right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name these left, right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I added a few tests covering mixed dtypes, nulls, more columns and different lengths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small requests, ping on green
pandas/core/reshape/merge.py
Outdated
|
||
Parameters | ||
---------- | ||
_left: DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you match the signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/reshape/merge.py
Outdated
|
||
Returns | ||
------- | ||
a tuple (_left_df, _right_df, how, cross_col) representing the adjusted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -803,3 +805,27 @@ def test_join_inner_multiindex_deterministic_order(): | |||
index=MultiIndex.from_tuples([(2, 1, 4, 3)], names=("b", "a", "d", "c")), | |||
) | |||
tm.assert_frame_equal(result, expected) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you moe to test_cross_merge (same dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand you correctly, it is directly below the other cross tests
@jreback greenish. Failure unrelated |
|
||
|
||
@pytest.mark.parametrize( | ||
("input_col", "output_cols"), [("b", ["a", "b"]), ("a", ["a_x", "a_y"])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry maybe i wasn't clear, can you make a new file called test_merge_cross.py and put these there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah ok, created the file and moved the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more case
if on is a duplicate column what happens? (raising is ok)
Good point, deleted the column before. Now we are raising a |
left.join(right, how="cross", on="a") | ||
|
||
|
||
def test_merge_cross_duplicate_on_column(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't mean this case (I don't think this is actually possible to happen).
I mean what if you have an input like
left=pd.DataFrame(['a': [1, 2], 'b': [3, 4]})
pd.merge(left, left, how='cross', on=['a', 'a'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could happen if we are really really unlucky :)
We do not allow on columns in case of cross, so we are safe with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I reverse my change then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah revert this change (its not worth checking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, will ping when green
ok i think your version before the last commit is good. ping on green. |
This reverts commit 4589651
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
thanks @phofl very nice! |
Great job, and thanks for implementing this! I've given your solution the advertisement it deserves on stack. Excited to see this in 1.2. |
@phofl created an issue for benchmarks it might make sense to change to use a broadcasting impl (like refs on SO) or use pandas cartesian product function - if the benchmarks show this - can create an issue or better to do benchmarks first |
@jreback If you are referring to asvs, I have added two benchmarks through this pr |
oh right u did! ok |
Will try to look into the numpy or cartesian function when I've got a bit more time uninterrupted |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This is a first draft of a
cross
method for merge operations. As suggested by @jreback I added a new column and dispatched toinner
as a first naive implementation. I am currently wondering, if there would be a better place to adjust the arguments than the places I selected. I wanted to avoid to modifyself
in some method called within the constructor, so i modified the inputs before adding the parameters toself
. This currently does not support thecross
method for join. If we get a consensus where to put the modifications, I would add support forjoin
. Also have to add more tests probably and something to the userguideHave to look into the performance afterwards. Copies of the Left and Right frames may be a big performance hit for bigger frames?