-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Categorical type #16015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical type #16015
Changes from all commits
790cd42
ed5c814
416d1d7
e6c05a0
41172ce
141e509
43f90cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to | |
df["B"] = raw_cat | ||
df | ||
|
||
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``: | ||
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of | ||
|
||
1. categories are inferred from the data | ||
2. categories are unordered. | ||
|
||
To control those behaviors, instead of passing ``'category'``, use an instance | ||
of :class:`~pandas.api.types.CategoricalDtype`. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(["a","b","c","a"]) | ||
s_cat = s.astype("category", categories=["b","c","d"], ordered=False) | ||
from pandas.api.types import CategoricalDtype | ||
|
||
s = pd.Series(["a", "b", "c", "a"]) | ||
cat_type = CategoricalDtype(categories=["b", "c", "d"], | ||
ordered=True) | ||
s_cat = s.astype(cat_type) | ||
s_cat | ||
|
||
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`: | ||
|
@@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode: | |
splitter = np.random.choice([0,1], 5, p=[0.5,0.5]) | ||
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"])) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a ref |
||
.. _categorical.categoricaldtype: | ||
|
||
CategoricalDtype | ||
---------------- | ||
|
||
.. versionchanged:: 0.21.0 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add these as bullet points |
||
A categorical's type is fully described by | ||
|
||
1. ``categories``: a sequence of unique values and no missing values | ||
2. ``ordered``: a boolean | ||
|
||
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`. | ||
The ``categories`` argument is optional, which implies that the actual categories | ||
should be inferred from whatever is present in the data when the | ||
:class:`pandas.Categorical` is created. The categories are assumed to be unordered | ||
by default. | ||
|
||
.. ipython:: python | ||
|
||
from pandas.api.types import CategoricalDtype | ||
|
||
CategoricalDtype(['a', 'b', 'c']) | ||
CategoricalDtype(['a', 'b', 'c'], ordered=True) | ||
CategoricalDtype() | ||
|
||
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas | ||
expects a `dtype`. For example :func:`pandas.read_csv`, | ||
:func:`pandas.DataFrame.astype`, or in the Series constructor. | ||
|
||
.. note:: | ||
|
||
As a convenience, you can use the string ``'category'`` in place of a | ||
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of | ||
the categories being unordered, and equal to the set values present in the | ||
array. In other words, ``dtype='category'`` is equivalent to | ||
``dtype=CategoricalDtype()``. | ||
|
||
Equality Semantics | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal | ||
whenever they have the same categories and orderedness. When comparing two | ||
unordered categoricals, the order of the ``categories`` is not considered | ||
|
||
.. ipython:: python | ||
|
||
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False) | ||
|
||
# Equal, since order is not considered when ordered=False | ||
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False) | ||
|
||
# Unequal, since the second CategoricalDtype is ordered | ||
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True) | ||
|
||
All instances of ``CategoricalDtype`` compare equal to the string ``'category'`` | ||
|
||
.. ipython:: python | ||
|
||
c1 == 'category' | ||
|
||
.. warning:: | ||
|
||
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``, | ||
and since all instances ``CategoricalDtype`` compare equal to ``'category'``, | ||
all instances of ``CategoricalDtype`` compare equal to a | ||
``CategoricalDtype(None, False)``, regardless of ``categories`` or | ||
``ordered``. | ||
|
||
Description | ||
----------- | ||
|
||
|
@@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order: | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(list('babc')).astype('category', categories=list('abcd')) | ||
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd'))) | ||
s | ||
|
||
# categories | ||
|
@@ -297,7 +376,9 @@ meaning and certain operations are possible. If the categorical is unordered, `` | |
|
||
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False)) | ||
s.sort_values(inplace=True) | ||
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True) | ||
s = pd.Series(["a","b","c","a"]).astype( | ||
CategoricalDtype(ordered=True) | ||
) | ||
s.sort_values(inplace=True) | ||
s | ||
s.min(), s.max() | ||
|
@@ -397,9 +478,15 @@ categories or a categorical with any list-like object, will raise a TypeError. | |
|
||
.. ipython:: python | ||
|
||
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True) | ||
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True) | ||
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True) | ||
cat = pd.Series([1,2,3]).astype( | ||
CategoricalDtype([3, 2, 1], ordered=True) | ||
) | ||
cat_base = pd.Series([2,2,2]).astype( | ||
CategoricalDtype([3, 2, 1], ordered=True) | ||
) | ||
cat_base2 = pd.Series([2,2,2]).astype( | ||
CategoricalDtype(ordered=True) | ||
) | ||
|
||
cat | ||
cat_base | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -830,8 +830,10 @@ The left frame. | |
|
||
.. ipython:: python | ||
|
||
from pandas.api.types import CategoricalDtype | ||
|
||
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,))) | ||
X = X.astype('category', categories=['foo', 'bar']) | ||
X = X.astype(CategoricalDtype(categories=['foo', 'bar'])) | ||
|
||
left = pd.DataFrame({'X': X, | ||
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))}) | ||
|
@@ -842,8 +844,11 @@ The right frame. | |
|
||
.. ipython:: python | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']), | ||
'Z': [1, 2]}) | ||
right = pd.DataFrame({ | ||
'X': pd.Series(['foo', 'bar'], | ||
dtype=CategoricalDtype(['foo', 'bar'])), | ||
'Z': [1, 2] | ||
}) | ||
right | ||
right.dtypes | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we should use the auto* things more readily in other places. Maybe make an issue about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did this instead of
autosummary
since there are a bunch of unrelated methods that are just there for NumPy duck-typing.