-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix concat of frames with extension types (no reindexed columns) #34339
BUG: Fix concat of frames with extension types (no reindexed columns) #34339
Conversation
Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-06-18 13:27:00 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -443,7 +452,7 @@ def _is_uniform_join_units(join_units: List[JoinUnit]) -> bool: | |||
# cannot necessarily join | |||
return ( | |||
# all blocks need to have the same type | |||
all(isinstance(ju.block, type(join_units[0].block)) for ju in join_units) | |||
all(type(ju.block) is type(join_units[0].block) for ju in join_units) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is to ensure DatetimeBlock and DatetimeTZBlock don't incorrectly take this "uniform join units" code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
I didn't immediately see the reason for your changesin dtypes/concat
, but I suppose it's related to
Now I also try to make use of this for DataFrame columns, if there is any extension dtype in the to-concat arrays.
So I think I'm following things now.
Do we need a whatsnew note?
So the change in Now, the axis is a bit confusing here, as we are actually always concatting the EAs in the same way (making one long EA), but the axis comes from the "higher" level whether we are concatting series or frames. So that is indeed as you say related to the fact that I now also take the EA-specific code path for DataFrame columns (which result in axis=1) and not only for Series (axis=0). When it comes to |
I added a general note about concat with extension arrays (that also references the previous PR that made this possible for Series). |
This should be ready for review / merge now. |
@@ -319,6 +319,15 @@ def _concatenate_join_units(join_units, concat_axis, copy): | |||
concat_values = concat_values.copy() | |||
else: | |||
concat_values = concat_values.copy() | |||
elif any(isinstance(t, ExtensionArray) for t in to_concat): | |||
# concatting with at least one EA means we are concatting a single column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it odd that the logic is now split here and concat_compat. I would strongly encourage to put this logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this here actually ensures a clear separation of concerns between dtypes/concat.py
and internals/concat.py
.
The concat_compat
from dtypes/concat.py
only needs to worry about concatting 1D arrays (for the case of ExtensionArrays, to be clear), and the code here in internals deals with ensuring the result is the proper dimension depending on how it is put in the BlockManager.
This keeps this BlockManager-related logic inside the internals, and ensures that concat_compat
/ dtypes/concat.py
doesn't need to care about internals-specific details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is you should have a concat_ea which you call here (right near concat_compat), otherwise the logic is scatttered. the *dispatching) logic is fine, the problem is the operational logic does not belong here. (alternative is to pass anotherr arg to concat_compat to do what you are doing here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a different concat_ea
, then at each location where concat_compat
is called, we would need to do a if any_ea: concat_ea(..); else: concat_compat(...)
What do you mean exactly with the "operational logic does not belong here". What part of the code below is the "operational" logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every bit you added is operational logic.
pls do something like this
elif any(isinstance(t, ExtensionArray) for t in to_concat):
values = concat_ea(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is moving L324-330. As I have said before you are mixing 2 types of logic here by adding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I understand you correctly. You want to see the following function be put in dtypes/concat.py
:
def concat_ea_frame(to_concat):
to_concat = [t if isinstance(t, ExtensionArray) else t[0, :] for t in to_concat]
concat_values = concat_compat(to_concat, axis=concat_axis)
if not isinstance(concat_values, ExtensionArray):
# if the result of concat is not an EA but an ndarray, reshape to
# 2D to put it a non-EA Block
concat_values = np.atleast_2d(concat_values)
return concat_values
and called her. Is that correct?
I can do that if it ends this discussion, but just to note: this would create a function in dtypes/concat.py
with logic tied to the internals (reshaping to conform to block's dimenions is something only needs to be known to the internals code) and that would only be used in the internals (here, in internals/concat.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is exactly inline with concat_dateime and concat_compat. I am surprised that you think this should be anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with logic tied to the internals (reshaping to conform to block's dimenions is something only needs to be known to the internals code) and that would only be used in the internals
i agree with Joris; internals-specific logic belongs in internals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess
Planning to merge tomorrow if there are no objections. |
Co-authored-by: Tom Augspurger <TomAugspurger@users.noreply.github.com>
Sorry @jorisvandenbossche, one more merge? :) |
@jorisvandenbossche IIRC there was some discussion of implementing Categorical._concat_same_type directly in the method and having union_categoricals defer to the method rather than the other way around. Is that awaiting consensus or just awaiting implementation? |
@jbrockmendel the discussion was here: #33607 (comment), AFAIK just awaiting someone doing it |
This PR tries to fix the case of concatting dataframes with extension dtypes and non-extension dtypes, like this example (giving object dtype on master, instead of preserving Int64 dtype):
Up to now, the
_get_common_dtype
machinery of the ExtensionArrays was only used for concatting Series (axis=0
inconcat_compat
). Now I also try to make use of this for DataFrame columns, if there is any extension dtype in the to-concat arrays. This is checked in_concatenate_join_units
. And we need to handle the dimension-mismach. I decided to handle that inside the internals code (reducing the dimension for non-EAs, and if the concat result is not an EA, adding back a dimension), soconcat_compat
does not need to be aware of this.Partly overlapping with #33522 (cc @TomAugspurger), but this PR does not yet fix the original case the other PR is fixing: concat with reindexed (thus all missing) columns becoming object instead of preserving object dtype.