Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement cross method for Merge Operations #37864

Merged
merged 30 commits into from
Nov 26, 2020
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 15, 2020

This is a first draft of a cross method for merge operations. As suggested by @jreback I added a new column and dispatched to inner as a first naive implementation. I am currently wondering, if there would be a better place to adjust the arguments than the places I selected. I wanted to avoid to modify self in some method called within the constructor, so i modified the inputs before adding the parameters to self. This currently does not support the cross method for join. If we get a consensus where to put the modifications, I would add support for join. Also have to add more tests probably and something to the userguide

Have to look into the performance afterwards. Copies of the Left and Right frames may be a big performance hit for bigger frames?

@phofl phofl marked this pull request as draft November 15, 2020 16:35
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not bad, can you add some asv's as well

@@ -591,6 +591,8 @@ def __init__(
):
_left = _validate_operand(left)
_right = _validate_operand(right)
if how == "cross":
_left, _right, how, on = self._create_cross_configuration(_left, _right, on)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the new column here (in addition to the other values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return result.__finalize__(self, method="merge")

def _maybe_drop_cross_column(self, result: "DataFrame"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the col in here (type the output)

def _create_cross_configuration(
self, _left, _right, on
) -> Tuple["DataFrame", "DataFrame", str, str]:
if on is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation should be done first (IOW maybe move create_cross_configuration lower where you are calling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, that I understand you correctly: Calling the function _create_cross_configuration should be done after calling _validate_specification? in this case I have moved the parts around

cross_col = f"{max([*_left.columns, *_right.columns])}_cross"
_left = _left.copy()
_right = _right.copy()
_left.insert(loc=0, value=1, column=cross_col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use .assign (IOW put it at the end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def _validate_specification(self):
if hasattr(self, "_cross"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking the attribute, rather can you check if how is 'cross'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on answer above, this is done

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Enhancement labels Nov 18, 2020
@phofl
Copy link
Member Author

phofl commented Nov 18, 2020

Thx for the feedback, will start to implement the changes in the evening probably

@pep8speaks
Copy link

pep8speaks commented Nov 18, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-25 22:31:42 UTC

or self.on is not None
):
raise MergeError(
"Can not pass any merge columns when using cross as merge method"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say that left_on,right_on,on must be None, and left_index,right_index must be False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

@phofl phofl marked this pull request as ready for review November 21, 2020 00:09
@jreback jreback added this to the 1.2 milestone Nov 21, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, pls also add all examples from the OP (hopefully was more than one).

also might be some examples on SO can add as tests.

@@ -221,6 +223,11 @@
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
* cross: creates the karthesian product from both frames, preserves the order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cartesian

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

German influence, sorry :)

@@ -341,6 +348,24 @@
...
ValueError: columns overlap but no suffix specified:
Index(['value'], dtype='object')

>>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example of an inner and left merge here (and put them right before this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, could you please check if this is similar to what you have in mind?

@@ -1200,9 +1218,50 @@ def _maybe_coerce_merge_keys(self):
typ = rk.categories.dtype if rk_is_cat else object
self.right = self.right.assign(**{name: self.right[name].astype(typ)})

def _create_cross_configuration(
self, _left, _right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name these left, right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@phofl
Copy link
Member Author

phofl commented Nov 22, 2020

I added a few tests covering mixed dtypes, nulls, more columns and different lengths

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small requests, ping on green


Parameters
----------
_left: DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you match the signature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Returns
-------
a tuple (_left_df, _right_df, how, cross_col) representing the adjusted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -803,3 +805,27 @@ def test_join_inner_multiindex_deterministic_order():
index=MultiIndex.from_tuples([(2, 1, 4, 3)], names=("b", "a", "d", "c")),
)
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you moe to test_cross_merge (same dir)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly, it is directly below the other cross tests

@phofl
Copy link
Member Author

phofl commented Nov 23, 2020

@jreback greenish. Failure unrelated



@pytest.mark.parametrize(
("input_col", "output_cols"), [("b", ["a", "b"]), ("a", ["a_x", "a_y"])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry maybe i wasn't clear, can you make a new file called test_merge_cross.py and put these there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah ok, created the file and moved the tests

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more case

if on is a duplicate column what happens? (raising is ok)

@phofl
Copy link
Member Author

phofl commented Nov 24, 2020

Good point, deleted the column before. Now we are raising a MergeError

left.join(right, how="cross", on="a")


def test_merge_cross_duplicate_on_column():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't mean this case (I don't think this is actually possible to happen).

I mean what if you have an input like

left=pd.DataFrame(['a': [1, 2], 'b': [3, 4]})
pd.merge(left, left, how='cross', on=['a', 'a'])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could happen if we are really really unlucky :)

We do not allow on columns in case of cross, so we are safe with this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I reverse my change then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah revert this change (its not worth checking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, will ping when green

@jreback
Copy link
Contributor

jreback commented Nov 25, 2020

ok i think your version before the last commit is good. ping on green.

@jreback jreback merged commit d8a2e74 into pandas-dev:master Nov 26, 2020
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

thanks @phofl very nice!

@Coldsp33d
Copy link

Great job, and thanks for implementing this! I've given your solution the advertisement it deserves on stack.

Excited to see this in 1.2.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

@phofl created an issue for benchmarks

it might make sense to change to use a broadcasting impl (like refs on SO) or use pandas cartesian product function - if the benchmarks show this - can create an issue or better to do benchmarks first

@phofl
Copy link
Member Author

phofl commented Nov 26, 2020

@jreback If you are referring to asvs, I have added two benchmarks through this pr

@phofl phofl deleted the 5401 branch November 26, 2020 13:43
@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

oh right u did! ok

@phofl
Copy link
Member Author

phofl commented Nov 26, 2020

Will try to look into the numpy or cartesian function when I've got a bit more time uninterrupted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: cross join in merge() and join()
4 participants