Skip to content

Commit 7cde4da

Browse files
committed
Initial Review Fixes
1 parent 0b2969f commit 7cde4da

File tree

9 files changed

+280
-39
lines changed

9 files changed

+280
-39
lines changed

doc/source/categorical.rst

+69-12
Original file line numberDiff line numberDiff line change
@@ -150,10 +150,10 @@ constructor to save the factorize step during normal constructor mode:
150150
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
151151
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
152152
153-
.. _categorical.objectcreation.frame:
153+
.. _categorical.objectcreation.existingframe:
154154

155-
Creating categories from a ``DataFrame``
156-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
155+
Creating categories from an existing ``DataFrame``
156+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
157157

158158
.. versionadded:: 0.22.0
159159

@@ -169,15 +169,6 @@ if a column does not contain all labels:
169169
df['A'].dtype
170170
df['B'].dtype
171171
172-
Note that this behavior is different than instantiating a ``DataFrame`` with categorical dtype, which will only assign
173-
categories to each column based on the labels present in each column:
174-
175-
.. ipython:: python
176-
177-
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
178-
df['A'].dtype
179-
df['B'].dtype
180-
181172
When using ``astype``, you can control the categories that will be present in each column by passing
182173
a ``CategoricalDtype``:
183174

@@ -199,6 +190,72 @@ discussed hold with subselection.
199190
df[['A', 'B']] = df[['A', 'B']].astype('category')
200191
df.dtypes
201192
193+
Note that you can use ``apply`` to set categories on a per-column basis:
194+
195+
.. ipython:: python
196+
197+
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']})
198+
df = df.apply(lambda x: x.astype('category'))
199+
df['A'].dtype
200+
df['B'].dtype
201+
202+
203+
.. _categorical.objectcreation.frameconstructor:
204+
205+
Creating categories from the ``DataFrame`` constructor
206+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207+
208+
.. versionchanged:: 0.22.0
209+
210+
.. warning::
211+
212+
Prior to version 0.22.0, the default behavior of the ``DataFrame`` constructor when a categorical dtype was
213+
passed was to operate on a per-column basis, meaning that only labels present in a given column would be categories
214+
for that column.
215+
216+
To promote consistency of behavior, from version 0.22.0 onwards instantiating a ``DataFrame`` with categorical
217+
dtype will by default use all labels in present all columns when setting categories, even if a column does not
218+
contain all labels. This is consistent with the new ``astype`` behavior described above.
219+
220+
Behavior prior to version 0.22.0:
221+
222+
.. code-block:: ipython
223+
224+
In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
225+
226+
In [3]: df
227+
Out[3]:
228+
A B
229+
0 a c
230+
1 b d
231+
2 c e
232+
233+
In [4]: df['A'].dtype
234+
Out[4]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
235+
236+
In [5]: df['B'].dtype
237+
Out[5]: CategoricalDtype(categories=['c', 'd', 'e'], ordered=False)
238+
239+
Behavior from version 0.22.0 onwards:
240+
241+
.. ipython:: python
242+
243+
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
244+
df
245+
df['A'].dtype
246+
df['B'].dtype
247+
248+
Like with ``astype``, you can control the categories that will be present in each column by passing
249+
a ``CategoricalDtype``:
250+
251+
.. ipython:: python
252+
253+
dtype = CategoricalDtype(categories=list('abdef'), ordered=True)
254+
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype=dtype)
255+
df
256+
df['A'].dtype
257+
df['B'].dtype
258+
202259
.. _categorical.categoricaldtype:
203260

204261
CategoricalDtype

doc/source/whatsnew/v0.22.0.txt

+41-4
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ labels present in all columns, even if a column does not contain all labels:
3131
df['A'].dtype
3232
df['B'].dtype
3333

34-
See the :ref:`categorical.objectcreation.frame` section of the documentation for more details and examples.
34+
See the :ref:`categorical.objectcreation.existingframe` section of the documentation for more details and examples.
35+
3536

3637
.. _whatsnew_0220.enhancements.other:
3738

@@ -47,9 +48,45 @@ Other Enhancements
4748
Backwards incompatible API changes
4849
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4950

50-
-
51-
-
52-
-
51+
.. _whatsnew_0220.api_breaking.frame_constructor_category:
52+
53+
Creating categories from the ``DataFrame`` constructor
54+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
55+
56+
To promote consistency with the :ref:`new behavior :meth:`DataFrame.astype` with categorical dtype <categorical.objectcreation.frame.warning>`,
57+
using the ``DataFrame`` constructor with categorical dtype will now by default use all labels in present all columns to set each column's
58+
categories, even if a column does not contain all labels (:issue:`12860`)
59+
60+
Previous Behavior:
61+
62+
.. code-block:: ipython
63+
64+
In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
65+
66+
In [3]: df
67+
Out[3]:
68+
A B
69+
0 a c
70+
1 b d
71+
2 c e
72+
73+
In [4]: df['A'].dtype
74+
Out[4]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
75+
76+
In [5]: df['B'].dtype
77+
Out[5]: CategoricalDtype(categories=['c', 'd', 'e'], ordered=False)
78+
79+
New Behavior:
80+
81+
.. ipython:: python
82+
83+
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': ['c', 'd', 'e']}, dtype='category')
84+
df
85+
df['A'].dtype
86+
df['B'].dtype
87+
88+
See the :ref:`categorical.objectcreation.frameconstructor` section of the documentation for more details and examples.
89+
5390

5491
.. _whatsnew_0220.api:
5592

pandas/core/algorithms.py

+39-2
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@
88

99
from pandas.core.dtypes.cast import maybe_promote
1010
from pandas.core.dtypes.generic import (
11-
ABCSeries, ABCIndex,
12-
ABCIndexClass, ABCCategorical)
11+
ABCSeries, ABCDataFrame, ABCIndex, ABCIndexClass, ABCCategorical)
1312
from pandas.core.dtypes.common import (
1413
is_unsigned_integer_dtype, is_signed_integer_dtype,
1514
is_integer_dtype, is_complex_dtype,
@@ -177,6 +176,18 @@ def _ensure_arraylike(values):
177176
return values
178177

179178

179+
def _ensure_arraylike2d(values):
180+
"""
181+
ensure a 2d structure is arraylike if not already
182+
"""
183+
if isinstance(values, ABCDataFrame):
184+
values = values.values
185+
elif not isinstance(values, np.ndarray):
186+
# assuming list-of-list-like structure
187+
values = np.asarray([_ensure_arraylike(x) for x in values])
188+
return values
189+
190+
180191
_hashtables = {
181192
'float64': (htable.Float64HashTable, htable.Float64Vector),
182193
'uint64': (htable.UInt64HashTable, htable.UInt64Vector),
@@ -377,6 +388,32 @@ def unique(values):
377388
unique1d = unique
378389

379390

391+
def unique2d(values, order='F'):
392+
"""
393+
Hash table-based unique. Uniques are returned in order
394+
of appearance. This does NOT sort, but the order of appearance
395+
can be altered via the order parameter.
396+
397+
Significantly faster than numpy.unique. Includes NA values.
398+
399+
Parameters
400+
----------
401+
values : 2d array-like
402+
order: string, default 'F'
403+
The order in which to ravel the 2d array, which corresponds to
404+
the order of appearance for unique values in the output. Valid
405+
values are {'C', 'F', 'A', 'K'}, see ``numpy.ravel`` for more
406+
details regarding the order parameter.
407+
408+
Returns
409+
-------
410+
ndarray of unique values.
411+
"""
412+
values = _ensure_arraylike2d(values)
413+
values = values.ravel(order=order)
414+
return unique(values)
415+
416+
380417
def isin(comps, values):
381418
"""
382419
Compute the isin boolean array

pandas/core/categorical.py

+18-1
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
is_dict_like)
3131
from pandas.core.common import is_null_slice, _maybe_box_datetimelike
3232

33-
from pandas.core.algorithms import factorize, take_1d, unique1d
33+
from pandas.core.algorithms import factorize, take_1d, unique1d, unique2d
3434
from pandas.core.accessor import PandasDelegate
3535
from pandas.core.base import (PandasObject,
3636
NoNewAttributesMixin, _shared_docs)
@@ -2332,3 +2332,20 @@ def _factorize_from_iterables(iterables):
23322332
# For consistency, it should return a list of 2 lists.
23332333
return [[], []]
23342334
return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
2335+
2336+
2337+
def _get_categorical_dtype_2d(values, dtype, **kwargs):
2338+
# GH 12860
2339+
# have a CategoricalDtype with set categories already
2340+
if isinstance(dtype, CategoricalDtype) and dtype.categories is not None:
2341+
return dtype
2342+
2343+
# construct a CategoricalDtype with appropriate categories
2344+
categories = kwargs.get('categories', None)
2345+
ordered = kwargs.get('ordered', None) or getattr(dtype, 'ordered', None)
2346+
2347+
if categories is None:
2348+
categories = unique2d(values)
2349+
categories = categories[notna(categories)]
2350+
2351+
return CategoricalDtype(categories, ordered)

pandas/core/frame.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
create_block_manager_from_arrays,
7676
create_block_manager_from_blocks)
7777
from pandas.core.series import Series
78-
from pandas.core.categorical import Categorical
78+
from pandas.core.categorical import Categorical, _get_categorical_dtype_2d
7979
import pandas.core.algorithms as algorithms
8080
from pandas.compat import (range, map, zip, lrange, lmap, lzip, StringIO, u,
8181
OrderedDict, raise_with_traceback)
@@ -6123,6 +6123,9 @@ def _arrays_to_mgr(arrays, arr_names, index, columns, dtype=None):
61236123
else:
61246124
index = _ensure_index(index)
61256125

6126+
if is_categorical_dtype(dtype):
6127+
dtype = _get_categorical_dtype_2d(arrays, dtype)
6128+
61266129
# don't force copy because getting jammed in an ndarray anyway
61276130
arrays = _homogenize(arrays, index, dtype)
61286131

pandas/core/generic.py

+3-14
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,7 @@
2626
is_list_like,
2727
is_dict_like,
2828
is_re_compilable,
29-
pandas_dtype,
30-
CategoricalDtype)
29+
pandas_dtype)
3130
from pandas.core.dtypes.cast import maybe_promote, maybe_upcast_putmask
3231
from pandas.core.dtypes.missing import isna, notna
3332
from pandas.core.dtypes.generic import ABCSeries, ABCPanel, ABCDataFrame
@@ -37,6 +36,7 @@
3736
SettingWithCopyWarning)
3837

3938
from pandas.core.base import PandasObject, SelectionMixin
39+
from pandas.core.categorical import _get_categorical_dtype_2d
4040
from pandas.core.index import (Index, MultiIndex, _ensure_index,
4141
InvalidIndexError)
4242
import pandas.core.indexing as indexing
@@ -3985,18 +3985,7 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
39853985

39863986
elif is_categorical_dtype(dtype) and self.ndim > 1:
39873987
# GH 12860
3988-
dtype_with_cat = (isinstance(dtype, CategoricalDtype) and
3989-
dtype.categories is not None)
3990-
if not dtype_with_cat:
3991-
categories = kwargs.get('categories', None)
3992-
ordered = (kwargs.get('ordered', None) or
3993-
getattr(dtype, 'ordered', None))
3994-
3995-
if categories is None:
3996-
categories = algos.unique(self.values.ravel(order='F'))
3997-
3998-
dtype = CategoricalDtype(categories, ordered)
3999-
3988+
dtype = _get_categorical_dtype_2d(self.values, dtype, **kwargs)
40003989
results = (self[col].astype(dtype, copy=copy) for col in self)
40013990
return pd.concat(results, axis=1, copy=False)
40023991

pandas/tests/dtypes/test_dtypes.py

+41-1
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,10 @@
77
import numpy as np
88
import pandas as pd
99
from pandas import (
10-
Series, Categorical, CategoricalIndex, IntervalIndex, date_range)
10+
Series, DataFrame, Categorical,
11+
CategoricalIndex, IntervalIndex, date_range)
1112

13+
from pandas.core.categorical import _get_categorical_dtype_2d
1214
from pandas.core.dtypes.dtypes import (
1315
DatetimeTZDtype, PeriodDtype,
1416
IntervalDtype, CategoricalDtype)
@@ -123,6 +125,44 @@ def test_tuple_categories(self):
123125
result = CategoricalDtype(categories)
124126
assert all(result.categories == categories)
125127

128+
def test_get_categorical_dtype_2d(self):
129+
# GH 12860
130+
values = DataFrame({0: ['a', 'b', 'c', 'a'],
131+
1: ['b', np.nan, 'd', 'd']})
132+
categories = ['a', 'b', 'c', 'd']
133+
134+
# no keywords
135+
expected = CategoricalDtype(categories=categories)
136+
result = _get_categorical_dtype_2d(values, 'category')
137+
assert is_dtype_equal(result, expected)
138+
139+
result = _get_categorical_dtype_2d(values, CategoricalDtype())
140+
assert is_dtype_equal(result, expected)
141+
142+
# ordered
143+
expected = CategoricalDtype(categories=categories, ordered=True)
144+
result = _get_categorical_dtype_2d(values, 'category', ordered=True)
145+
assert is_dtype_equal(result, expected)
146+
147+
result = _get_categorical_dtype_2d(
148+
values, CategoricalDtype(ordered=True))
149+
assert is_dtype_equal(result, expected)
150+
151+
# custom categories
152+
custom = ['a', 'b', 'c', 'e']
153+
expected = CategoricalDtype(categories=custom)
154+
result = _get_categorical_dtype_2d(
155+
values, 'category', categories=custom)
156+
assert is_dtype_equal(result, expected)
157+
158+
result = _get_categorical_dtype_2d(
159+
values, CategoricalDtype(categories=custom))
160+
assert is_dtype_equal(result, expected)
161+
162+
# CategoricalDtype with set categories
163+
result = _get_categorical_dtype_2d(values, expected)
164+
assert is_dtype_equal(result, expected)
165+
126166

127167
class TestDatetimeTZDtype(Base):
128168

0 commit comments

Comments
 (0)