-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support EAs in Series.unstack #23284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support EAs in Series.unstack #23284
Conversation
@@ -102,7 +102,10 @@ def copy(self, deep=False): | |||
def astype(self, dtype, copy=True): | |||
if isinstance(dtype, type(self.dtype)): | |||
return type(self)(self._data, context=dtype.context) | |||
return super(DecimalArray, self).astype(dtype, copy) | |||
# need to replace decimal NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series.equal doesn't consider Series([np.nan]) == Series([Decimal('NaN')])
. I made this change mainly to facilitate that.
Hello @TomAugspurger! Thanks for updating the PR.
Comment last updated on October 22, 2018 at 21:41 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor comments
pandas/core/reshape/reshape.py
Outdated
@@ -947,3 +950,22 @@ def make_axis_dummies(frame, axis='minor', transform=None): | |||
values = values.take(labels, axis=0) | |||
|
|||
return DataFrame(values, columns=items, index=frame.index) | |||
|
|||
|
|||
def unstack_extension_series(series, level, fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this function up to around line 424? It looks like this file has all unstack
related code grouped together first, followed by stack
code grouped together, so having unstack_extension_series
at the bottom seems a little out of place.
n = index.nlevels | ||
levels = list(range(n)) | ||
# [0, 1, 2] | ||
# -> [(0,), (1,), (2,) (0, 1), (1, 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be -> [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
? Not super important, but caused me a brief moment of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're correct.
Just fixed the decimal failures. There will be a remaining test failure I haven't addressed yet. We had a test that did In [2]: cat = pd.Categorical(['a', 'a', 'b'])
In [3]: cat.take([0, -1, -1], fill_value='d', allow_fill=True) Should that raise? Return a Categorical with categories I'm having some deja vu right now; I think we've discussed this before. I think if we were designing that today, we wouldn't have allowed that. |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -807,6 +807,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your | |||
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`) | |||
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`). | |||
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`) | |||
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? what does this change for Categorical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously Series[Categorical].unstack()
returned DataFrame[object
].
Now it'll be a DataFrame[Categorical]
, i.e. unstack()
preserves the CategoricalDtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forget. Previously, we went internally went Categorical -> object -> Categorical
. Now we avoid the conversion to categorical.
So the changes from 0.23.4 will be
- Series[category].unstack() avoids a conversion to object
- Series[Sparse].unstack is sparse (no intermediate conversion to dense)
Onces DatetimeTZ is an ExtensionArray, then we'll presumably preserve that as well. On 0.23.4, we convert to datetime64ns
In [48]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])
In [49]: ser = pd.Series(pd.date_range('2000', periods=3, tz="US/Central"), index=index)
In [50]: ser.unstack().dtypes
Out[50]:
0 datetime64[ns]
1 datetime64[ns]
dtype: object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this might be need a larger note then
I'm actually rethinking, this. Maybe we would want to allow this. It's a pretty clear statement of user intent, and I could easily imaging someone wanting to do something like "take, but fill missing values (-1) with "None" or "other". |
Do you want to resolve the fill_value` question here, or leave for #23296 ? (as I mentioned there: I would preserve the dtype, which then means only allowing a fill_value that is NaN or within the categories) |
We can ignore If we agree that
I'm not sure which is preferred. This is blocked by #23296 for now. |
Apparently, In [51]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])
In [52]: ser = pd.Series(pd.Categorical(['a', 'b', 'a']), index=index)
In [53]: ser.unstack(fill_value='c')
Out[53]:
0 1
A a b
B NaN a
In [54]: ser.unstack(fill_value='a')
Out[54]:
0 1
A a b
B a a We just silently didn't fill the |
+1 |
https://github.com/pandas-dev/pandas/pull/10246/files#diff-79e0785420ae1c686623848c4d561486R261 indicates that this was deliberate, but I didn't see any discussion / documentation around it, so I'm calling it a bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this now also work for unstacking a DataFrame with an EA column? If so, maybe add that to the test case?
Codecov Report
@@ Coverage Diff @@
## master #23284 +/- ##
==========================================
- Coverage 92.25% 92.23% -0.02%
==========================================
Files 161 161
Lines 51186 51198 +12
==========================================
+ Hits 47222 47224 +2
- Misses 3964 3974 +10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, module a subsection in the docs
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -807,6 +807,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your | |||
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`) | |||
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`). | |||
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`) | |||
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this might be need a larger note then
right_na = right.isna() | ||
def convert(x): | ||
# need to convert array([Decimal(NaN)], dtype='object') to np.NaN | ||
# because Series[object].isnan doesn't recognize decimal(NaN) as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this come up now? e.g. whats an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [21]: import pandas as pd
In [22]: import pandas.util.testing as tm
In [23]: from pandas.tests.extension.decimal import to_decimal
In [24]: ser = pd.Series(to_decimal(['1.0', 'NaN']))
In [25]: tm.assert_series_equal(ser.astype(object), ser.astype(object))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-25-5356adf50e72> in <module>
----> 1 tm.assert_series_equal(ser.astype(object), ser.astype(object))
~/sandbox/pandas/pandas/util/testing.py in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
1293 check_less_precise=check_less_precise,
1294 check_dtype=check_dtype,
-> 1295 obj='{obj}'.format(obj=obj))
1296
1297 # metadata comparison
~/sandbox/pandas/pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()
64
65
---> 66 cpdef assert_almost_equal(a, b,
67 check_less_precise=False,
68 bint check_dtype=True,
~/sandbox/pandas/pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()
178 msg = '{0} values are different ({1} %)'.format(
179 obj, np.round(diff * 100.0 / na, 5))
--> 180 raise_assert_detail(obj, msg, lobj, robj)
181
182 return True
~/sandbox/pandas/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
1080 msg += "\n[diff]: {diff}".format(diff=diff)
1081
-> 1082 raise AssertionError(msg)
1083
1084
AssertionError: Series are different
Series values are different (50.0 %)
[left]: [1.0, NaN]
[right]: [1.0, NaN]
We do the astype(object) to build the expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug, and would solve the aboe i think.
In [7]: ser.astype(object).isna()
Out[7]:
0 False
1 False
dtype: bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger actually, can you explain why the current code in master is not working fine? Why do you need to convert to object? Because before, there was already the calls to isna
to check NaNs and non-NaNs separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my point. I think there is a bug somewhere here, e.g. isna is maybe not dispatching to the EA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's all about creating the expected result. When we go to do the final assert that the values match, we do
result = result.astype(object)
self.assert_frame_equal(result, expected)
but self.assert_frame_equal
will say that Series([Decimal('NaN')], dtype='object')
isn't equal to itself, since it doesn't consider that value NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, I somehow thought you were doing the astype(object)
inside the assert testing machinery above, not in the actual expected result. Yes, that makes sense now.
For me it is fine to keep this hack in here for now. In the end that is somehow the purpose of using the class instances for assert_.._equal
so a specific EA can override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the .astype(object)
for Decimal is incorrect.
In [5]: ser.values
Out[5]: DecimalArray(array([Decimal('1.0'), Decimal('NaN')], dtype=object))
In [6]: ser.astype(object)
Out[6]:
0 1.0
1 NaN
dtype: object
In [7]: ser.astype(object).values
Out[7]: array([Decimal('1.0'), Decimal('NaN')], dtype=object)
on the these should be converted to np.nan
and not Decimal('NaN')
I think as this is just a numpy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object-dtype can store anything, including decimal objects. It'd be strange to only convert Decimal("NaN")
to np.nan
, and not Decimal('1.0')
to 1.0, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems correct to me.
Going to merge, if we need to further discuss this, we can do that in another issue (it's not really related anymore with actually fixing unstack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to merge, just some questions about the decimal nan checking
K, I'll fix the |
Hmm, fixing |
this would only be for object input, and can you all |
Ahh I thought we would have to pay the extra cost on string dtypes too, but it seems like those are handled before we get to a generic object dtype. This should be doable. |
Well, I'm going back to -1 on support decimal here, unless we can find a better way than a basic isnstance. diff --git a/doc/source/whatsnew/v0.24.0.txt b/doc/source/whatsnew/v0.24.0.txt
index f449ca532..c8c5db611 100644
--- a/doc/source/whatsnew/v0.24.0.txt
+++ b/doc/source/whatsnew/v0.24.0.txt
@@ -1227,6 +1227,7 @@ Missing
- Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
- :func:`Series.isin` now treats all NaN-floats as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
- :func:`unique` no longer mangles NaN-floats and the ``NaT``-object for `np.object`-dtype, i.e. ``NaT`` is no longer coerced to a NaN-value and is treated as a different entity. (:issue:`22295`)
+- :meth:`isna` now considers ``decimal.Decimal('NaN')`` a missing value (:issue:`23284`).
MultiIndex
diff --git a/pandas/_libs/missing.pyx b/pandas/_libs/missing.pyx
index b87913592..4fa96f652 100644
--- a/pandas/_libs/missing.pyx
+++ b/pandas/_libs/missing.pyx
@@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-
import cython
+import decimal
from cython import Py_ssize_t
import numpy as np
@@ -33,6 +34,8 @@ cdef inline bint _check_all_nulls(object val):
res = get_datetime64_value(val) == NPY_NAT
elif util.is_timedelta64_object(val):
res = get_timedelta64_value(val) == NPY_NAT
+ elif isinstance(val, decimal.Decimal):
+ return val.is_nan()
else:
res = 0
return res
@@ -71,6 +74,8 @@ cpdef bint checknull(object val):
return get_timedelta64_value(val) == NPY_NAT
elif util.is_array(val):
return False
+ elif isinstance(val, decimal.Decimal):
+ return val.is_nan()
else:
return val is None or util.is_nan(val)
some timings
the object array is an object-dtype series with 20,000 elements. The decimal array is an object-dtype series with 20,000 decimal elements. I don't really care about the last one being 3.6x slower, since we're getting the correct result. I'm more concerned about the others. |
this moves calls to python land. Try
|
According to https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#built-in-functions both hasattr and isintsance are optimized (but require some interaction with python land).
so pretty similar. I don't really know why the object array would be 6x slower now though. |
hmm i guess just the additional check is causing this. but a more general questions. should we even be checking for this in an ndarray object array at all? e.g. we don't do this for a random foo object. It must be a decimal array (in which case you just ask it .isnan).? |
Shall we leave possible changes/support for decimal in the internals for another issue or PR? |
totally fine. @TomAugspurger can you xfail these tests rather than change them though. and create an issue to update. |
Isn't the current code fine? It's contained in |
#23530 for isna(decimal). Fixed the merge conflict. |
All green. |
@TomAugspurger Thanks! |
…fixed * upstream/master: (47 commits) CLN: remove values attribute from datetimelike EAs (pandas-dev#23603) DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381) PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589) PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591) TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502) Update description of Index._values/values/ndarray_values (pandas-dev#23507) Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552) DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953) ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654) CI: Auto-cancel redundant builds (pandas-dev#23523) Preserve EA dtype in DataFrame.stack (pandas-dev#23285) TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468) BUG: raise if invalid freq is passed (pandas-dev#23546) remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562) BUG: Fix error message for invalid HTML flavor (pandas-dev#23550) ENH: Support EAs in Series.unstack (pandas-dev#23284) DOC: Updating DataFrame.join docstring (pandas-dev#23471) TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888) BUG: Return KeyError for invalid string key (pandas-dev#23540) BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852) ...
Closes #23077
This prevents ExtensionArray-backed series from being converted to object-dtype in unstack.
The strategy is to do a dummy unstack on an ndarray of integers, which provides the
indices
totake
later on. We then concat together at the end. This provided decent performance, and seems pretty maintainable in the long run.I'll post some benchmarks later.
Do we want to do DataFrame.stack() in the same PR?