Skip to content

Categorical type #16015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 23, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup

.. ipython:: python

from pandas.api.types import CategoricalDtype

df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
5 changes: 4 additions & 1 deletion doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~

If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we should use the auto* things more readily in other places. Maybe make an issue about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did this instead of autosummary since there are a bunch of unrelated methods that are just there for NumPy duck-typing.


If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
following usable methods and properties:

Expand Down
103 changes: 95 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df

You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`~pandas.api.types.CategoricalDtype`.

.. ipython:: python

s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
from pandas.api.types import CategoricalDtype

s = pd.Series(["a", "b", "c", "a"])
cat_type = CategoricalDtype(categories=["b", "c", "d"],
ordered=True)
s_cat = s.astype(cat_type)
s_cat

Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -133,6 +143,75 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref

.. _categorical.categoricaldtype:

CategoricalDtype
----------------

.. versionchanged:: 0.21.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add these as bullet points

A categorical's type is fully described by

1. ``categories``: a sequence of unique values and no missing values
2. ``ordered``: a boolean

This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created. The categories are assumed to be unordered
by default.

.. ipython:: python

from pandas.api.types import CategoricalDtype

CategoricalDtype(['a', 'b', 'c'])
CategoricalDtype(['a', 'b', 'c'], ordered=True)
CategoricalDtype()

A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.

.. note::

As a convenience, you can use the string ``'category'`` in place of a
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the
array. In other words, ``dtype='category'`` is equivalent to
``dtype=CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and orderedness. When comparing two
unordered categoricals, the order of the ``categories`` is not considered

.. ipython:: python

c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)

# Equal, since order is not considered when ordered=False
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)

# Unequal, since the second CategoricalDtype is ordered
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)

All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python

c1 == 'category'

.. warning::

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
all instances of ``CategoricalDtype`` compare equal to a
``CategoricalDtype(None, False)``, regardless of ``categories`` or
``ordered``.

Description
-----------

Expand Down Expand Up @@ -184,7 +263,7 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python

s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
s

# categories
Expand Down Expand Up @@ -297,7 +376,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``

s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(
CategoricalDtype(ordered=True)
)
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -397,9 +478,15 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python

cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base = pd.Series([2,2,2]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base2 = pd.Series([2,2,2]).astype(
CategoricalDtype(ordered=True)
)

cat
cat_base
Expand Down
11 changes: 8 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -830,8 +830,10 @@ The left frame.

.. ipython:: python

from pandas.api.types import CategoricalDtype

X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))

left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +844,11 @@ The right frame.

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'],
dtype=CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes

Expand Down
27 changes: 27 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.

Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.

Expand Down Expand Up @@ -89,6 +91,31 @@ This does not raise any obvious exceptions, but also does not create a new colum

Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical`` (:issue:`14711`,
:issue:`15078`, :issue:`16015`):

.. ipython:: python

from pandas.api.types import CategoricalDtype

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
Loading