-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Categorical type #16015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical type #16015
Conversation
.dtype on a Categorical/CI should construct this Categorica.is_dtype_equal should just work |
can u run asv on this to make sure nothing much changing perf wise |
Ha, that's why it felt like so much less code this time, because I forgot to do the other half :) I'll work on this a bit more today, but will probably push to 0.21 |
yeah this feels like it should be in 0.21.0. maybe we also expose this in |
pandas/core/dtypes/dtypes.py
Outdated
@@ -131,6 +171,33 @@ def __eq__(self, other): | |||
|
|||
return isinstance(other, CategoricalDtype) | |||
|
|||
@staticmethod | |||
def _hash_categories(categories): | |||
from pandas.tools.hashing import hash_array, _combine_hash_arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just usehash_pandas_object
. Its an ordered hash of the data.
@@ -356,6 +357,70 @@ def test_not_string(self): | |||
self.assertFalse(is_string_dtype(PeriodDtype('D'))) | |||
|
|||
|
|||
class TestCategoricalDtypeParametrized(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just call this TestCategoricalDtype
(naming consistency)
result = pd.Series(['a', 'b'], dtype=pd.CategoricalDtype(['b', 'a'])) | ||
assert is_categorical_dtype(result) | ||
tm.assert_index_equal(result.cat.categories, pd.Index(['b', 'a'])) | ||
assert result.cat.ordered is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's more idiomatic: assert ... is False
or assert not ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are asserting that the result is a boolean (False
) , then the first is more appropriate. If its just not True
then the 2nd is fine (e.g. imagine 0 satisfies 2nd but not 1st). In this case we do want to assert a boolean result.
def test_categories(self): | ||
result = CategoricalDtype(['a', 'b', 'c']) | ||
tm.assert_index_equal(result.categories, pd.Index(['a', 'b', 'c'])) | ||
assert result.ordered is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as here
* API: expose dtypes in pandas.api.types xref #16015 xref apache/arrow#585 xref #16042 xref #15541 (comment)
5a7c498
to
d10db1a
Compare
* API: expose dtypes in pandas.api.types xref pandas-dev#16015 xref apache/arrow#585 xref pandas-dev#16042 xref pandas-dev#15541 (comment)
d10db1a
to
b731138
Compare
b731138
to
04a77d3
Compare
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on September 23, 2017 at 16:34 Hours UTC |
04a77d3
to
b79ae6e
Compare
pandas/tests/dtypes/test_dtypes.py
Outdated
@@ -618,3 +618,8 @@ def test_categorical_equality(self, ordered, other, expected): | |||
c1 = CategoricalDtype(['a', 'b'], ordered) | |||
result = c1 == other | |||
assert result == expected | |||
|
|||
def test_mixed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to fail at the moment. hash_object_array
tries to work on the mixed array first, and if that fails it falls back to hash_object_array(a.astype(str))
, which will cause a "collision" with the section set of categoricals. Not sure how I'll handle this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. I don't like the MultiIndex change (already commened on that PR). instead I would simply just change it (rather than adding a kw); the small api break is fine.
doc/source/categorical.rst
Outdated
CategoricalDtype | ||
---------------- | ||
|
||
.. versionadded:: 0.21.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged; it exists now.
pandas/core/api.py
Outdated
@@ -6,6 +6,7 @@ | |||
|
|||
from pandas.core.algorithms import factorize, unique, value_counts | |||
from pandas.core.dtypes.missing import isna, isnull, notna, notnull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think exposing this is a good idea here. We don't directly do this with other dtypes. instead the are in pandas.api.types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I could go either way on this. It would be the only type at the top level. On the other hand pd.api.types.CategoricalDtype(values)
is quite long :)
I'll think about it some more, but you're probably right.
dtype = CategoricalDtype() | ||
categories = getattr(dtype, 'categories', None) | ||
ordered = getattr(dtype, 'ordered', False) | ||
dtype = CategoricalDtype(categories=categories, ordered=ordered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not really necessary if you allow CategoricalDtype(dtype)
IOW, you can check whether the incoming categories
is an instance of a CagetegoricalDtype
and act accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like overloading the categories
like that.
pandas/core/dtypes/dtypes.py
Outdated
Type for categorical data with the categories and orderedness, | ||
but not the values. | ||
|
||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged
pandas/core/dtypes/dtypes.py
Outdated
Examples | ||
-------- | ||
>>> t = CategoricalDtype(categories=['b', 'a'], ordered=True) | ||
>>> s = Series(['a', 'a', 'b', 'b', 'a']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't show anything not related to the actual dtype here.
pandas/core/dtypes/dtypes.py
Outdated
|
||
def __new__(cls, categories=None, ordered=False, fastpath=False): | ||
from pandas.core.indexes.base import Index | ||
if categories is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the import inside the function
pandas/core/dtypes/dtypes.py
Outdated
# validation | ||
cls._validate_categories(categories, fastpath=fastpath) | ||
cls._validate_ordered(ordered) | ||
# We have a choice when hashing pandas unordered categoricals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use blank lines to indicate different thoughts
pandas/core/dtypes/dtypes.py
Outdated
|
||
def __new__(cls, categories=None, ordered=False, fastpath=False): | ||
from pandas.core.indexes.base import Index | ||
if categories is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categories could be a CategoricalDtype (see my comment above)
d5eb041
to
f069158
Compare
@TomAugspurger just to check: this is ready for review? |
7ef02b4
to
91752fe
Compare
@jorisvandenbossche Agreed about not having a Initially we used |
8364e7c
to
cd5f4e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look good. a few changes.
@@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like | |||
Categorical | |||
~~~~~~~~~~~ | |||
|
|||
If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical | |||
.. autoclass:: api.types.CategoricalDtype | |||
:members: categories, ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we should use the auto* things more readily in other places. Maybe make an issue about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we in general need to do this, as for most functions/methods, we already have the generated pages to link to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did this instead of autosummary
since there are a bunch of unrelated methods that are just there for NumPy duck-typing.
doc/source/categorical.rst
Outdated
|
||
A categorical's type is fully described by | ||
|
||
1. its categories: a sequence of unique values and no missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe show as
1. ``categories``: ....
2. ``ordered``: .....
as these are the actual kwargs
doc/source/categorical.rst
Outdated
expects a `dtype`. For example :func:`pandas.read_csv`, | ||
:func:`pandas.DataFrame.astype`, or in the Series constructor. | ||
|
||
As a convenience, you can use the string ``'category'`` in place of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this in a warning/note as this is the backward compat aspect here. (and then refer to this in the whatsnew)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is back incompat about doing dtype='category'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Jeff was saying that was the only way to do things in the past (backwards compatible). But I don't want to give the impression that we're thinking about deprecating dtype='category'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you didn't do that actual deprecation (or not yet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you didn't do that actual deprecation (or not yet)?
I don't think this is desirable, 'category' is a nice convenience
Sorry, I was speaking about astype(dtype='category', categories=[...])
. I suppose I misread Jeff's comment (as it was about backwards compatibility, I thought it was about that part, as this is the only part that might be back incompat, if we deprecate categories=..
(not dtype='category'
)
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -10,6 +10,8 @@ users upgrade to this version. | |||
Highlights include: | |||
|
|||
- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`. | |||
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying | |||
categoricals independent of the data (:issue:`14711`, :issue:`15078`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add the link the sub-section (and then list the issues there instead)
doc/source/whatsnew/v0.21.0.txt
Outdated
expanded to include the ``categories`` and ``ordered`` attributes. A | ||
``CategoricalDtype`` can be used to specify the set of categories and | ||
orderedness of an array, independent of the data themselves. This can be useful, | ||
e.g., when converting string data to a ``Categorical``: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
pandas/core/dtypes/dtypes.py
Outdated
data = self.categories._format_data(name=self.__class__.__name__) | ||
return tpl.format(data, self.ordered) | ||
|
||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
categories = list(categories) # breaks if a np.array of categories | ||
cat_array = hash_tuples(categories) | ||
else: | ||
if categories.dtype == 'O': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
# we want to reuse self.dtype if possible, i.e. neither are | ||
# overridden. | ||
if dtype is not None and (categories is not None or | ||
ordered is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg?
(pd.Categorical(['a', 'b']), CategoricalDtype(['a', 'b'])), | ||
(pd.CategoricalIndex(['a', 'b']).dtype, CategoricalDtype(['a', 'b'])), | ||
(pd.CategoricalIndex(['a', 'b']), CategoricalDtype(['a', 'b'])), | ||
(CategoricalDtype(), CategoricalDtype()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave original pd.Categorical(['a', 'b']).dtype, CategoricalDtype())
which should still work
assert result == expected | ||
|
||
def test_invalid_raises(self): | ||
with tm.assert_raises_regex(TypeError, 'ordered'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might add a test for passing just a string for the first argument (should fail)
e.g.
CategoricalDtype('category')
(see my comment about a from_string
above, which we should add for compat.
Correct, and I don't plan to deprecate `dtype'category'`.
…On Mon, Sep 18, 2017 at 5:55 AM, Joris Van den Bossche < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/categorical.rst
<#16015 (comment)>:
> +should be inferred from whatever is present in the data when the
+:class:`pandas.Categorical` is created.
+
+.. ipython:: python
+
+ from pandas.api.types import CategoricalDtype
+
+ CategoricalDtype(['a', 'b', 'c'])
+ CategoricalDtype(['a', 'b', 'c'], ordered=True)
+ CategoricalDtype()
+
+A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
+expects a `dtype`. For example :func:`pandas.read_csv`,
+:func:`pandas.DataFrame.astype`, or in the Series constructor.
+
+As a convenience, you can use the string ``'category'`` in place of a
Yes, but you didn't do that actual deprecation (or not yet)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16015 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItygdUADe7vz66Heen2kIDzUzdl9ks5sjkwVgaJpZM4M-hoW>
.
|
I don't think this is desirable, 'category' is a nice convenience |
We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
Get a valid instance of `CategoricalDtype` as early as possible, and use that throughout.
a9e0538
to
416d1d7
Compare
416d1d7 has the changes to
I think all three are useful and (hopefully) well-tested now. I'll update the docstring to indicate the priority (basically moving that comment to the docstring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple of comments, which can be put in a followup. let's merge on green.
doc/source/categorical.rst
Outdated
|
||
Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``, | ||
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``, | ||
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you should say CategoricalDtype(None, False)
at the end.
categories = list(categories) # breaks if a np.array of categories | ||
cat_array = hash_tuples(categories) | ||
else: | ||
if categories.dtype == 'O': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neh hash collisions basically don't happen in the space of things we are doing (normally). however you can collide by construction, e.g. code doing something wrong..
In any event we should move this code (can be a followup, maybe make an issue for things to do)
return hash(self) == hash(other) | ||
|
||
def __unicode__(self): | ||
tpl = u'CategoricalDtype(categories={}ordered={})' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this hacky thing in Categorical to repr only 10 categories. CategoricalIndex actually correctly uses the option max_categories
, so should do that here (again can do as a followup)
tm.assert_index_equal(result.categories, expected) | ||
assert result.ordered is False | ||
|
||
def test_constructor_tuples_datetimes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is janky. I don't think we should actually allow it. Where did this example come from? So we allow tuples as categories, I guess so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was extracted from a groupby test that was failing when I improperly handled tuples of level-1. Agreed that it's weird, but I'm inclined to support it.
doc/source/whatsnew/v0.21.0.txt
Outdated
expanded to include the ``categories`` and ``ordered`` attributes. A | ||
``CategoricalDtype`` can be used to specify the set of categories and | ||
orderedness of an array, independent of the data themselves. This can be useful, | ||
e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is 1 more issue you listed in the top of the PR
Thanks. I fixed up the doc comments. I'll merge on green and do the other
followups later today.
…On Fri, Sep 22, 2017 at 7:31 PM, Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/source/whatsnew/v0.21.0.txt
<#16015 (comment)>:
> @@ -89,6 +91,30 @@ This does not raise any obvious exceptions, but also does not create a new colum
Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
+.. _whatsnew_0210.enhancements.categorical_dtype:
+
+``CategoricalDtype`` for specifying categoricals
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
+expanded to include the ``categories`` and ``ordered`` attributes. A
+``CategoricalDtype`` can be used to specify the set of categories and
+orderedness of an array, independent of the data themselves. This can be useful,
+e.g., when converting string data to a ``Categorical`` (:issue:`14711`, :issue:`15078`):
there is 1 more issue you listed in the top of the PR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16015 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIlwjFqIXM4vMTgsVZ20tYOzp9CQSks5slFFogaJpZM4M-hoW>
.
|
nice patch @TomAugspurger ! |
Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
We extended the CategoricalDtype to accept optional
categories
andordered
argument. CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with
categories=None
, thedefault, is to infer the categories from whatever is present.
This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g.
.astype
,.read_csv
, and theSeries
constructor.Closes #14711
Closes #15078
Closes #14676
Dunno if we would want to do this for 0.20 or not. It ended up not being too big a change, so maybe.