Skip to content

Get dummies #4458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 26, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,13 @@ Data manipulations
merge
concat

.. currentmodule:: pandas.core.reshape

.. autosummary::
:toctree: generated/

get_dummies

Top-level missing data
~~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ pandas 0.13
``ValueError`` (:issue:`4303`, :issue:`4305`)
- ``read_excel`` now supports an integer in its ``sheetname`` argument giving
the index of the sheet to read in (:issue:`4301`).
- ``get_dummies`` works with NaN (:issue:`4446`)
- Added a test for ``read_clipboard()`` and ``to_clipboard()`` (:issue:`4282`)
- Text parser now treats anything that reads like inf ("inf", "Inf", "-Inf",
"iNf", etc.) to infinity. (:issue:`4220`, :issue:`4219`), affecting
Expand Down
11 changes: 11 additions & 0 deletions doc/source/v0.13.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,17 @@ Enhancements
- Added a more informative error message when plot arguments contain
overlapping color and style arguments (:issue:`4402`)

- NaN handing in get_dummies (:issue:`4446`) with `dummy_na`

.. ipython:: python
# previously, nan was erroneously counted as 2 here
# now it is not counted at all
get_dummies([1, 2, np.nan])

# unless requested
get_dummies([1, 2, np.nan], dummy_na=True)


- ``timedelta64[ns]`` operations

- A Series of dtype ``timedelta64[ns]`` can now be divided by another
Expand Down
60 changes: 54 additions & 6 deletions pandas/core/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import pandas.core.common as com
import pandas.algos as algos

from pandas.core.index import MultiIndex
from pandas.core.index import Index, MultiIndex


class ReshapeError(Exception):
Expand Down Expand Up @@ -805,7 +805,7 @@ def convert_dummies(data, cat_variables, prefix_sep='_'):
return result


def get_dummies(data, prefix=None, prefix_sep='_'):
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False):
"""
Convert categorical variable into dummy/indicator variables

Expand All @@ -816,19 +816,67 @@ def get_dummies(data, prefix=None, prefix_sep='_'):
String to append DataFrame column names
prefix_sep : string, default '_'
If appending prefix, separator/delimiter to use
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.

Returns
-------
dummies : DataFrame

Examples
--------
>>> s = pd.Series(list('abca'))

>>> get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

>>> s1 = ['a', 'b', np.nan]

>>> get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0

>>> get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1

"""
cat = Categorical.from_array(np.asarray(data))
dummy_mat = np.eye(len(cat.levels)).take(cat.labels, axis=0)
cat = Categorical.from_array(Series(data)) # Series avoids inconsistent NaN handling
levels = cat.levels

# if all NaN
if not dummy_na and len(levels) == 0:
if isinstance(data, Series):
index = data.index
else:
index = np.arange(len(data))
return DataFrame(index=index)

number_of_cols = len(levels)
if dummy_na:
number_of_cols += 1

dummy_mat = np.eye(number_of_cols).take(cat.labels, axis=0)

if dummy_na:
levels = np.append(cat.levels, np.nan)
else:
# reset NaN GH4446
dummy_mat[cat.labels == -1] = 0

if prefix is not None:
dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v))
for v in cat.levels]
for v in levels]
else:
dummy_cols = cat.levels
dummy_cols = levels

if isinstance(data, Series):
index = data.index
Expand Down
60 changes: 58 additions & 2 deletions pandas/tests/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@

import nose

from pandas import DataFrame
from pandas import DataFrame, Series
import pandas as pd

from numpy import nan
import numpy as np

from pandas.core.reshape import melt, convert_dummies, lreshape
from pandas.util.testing import assert_frame_equal

from pandas.core.reshape import melt, convert_dummies, lreshape, get_dummies
import pandas.util.testing as tm
from pandas.compat import StringIO, cPickle, range

Expand Down Expand Up @@ -145,6 +147,60 @@ def test_multiindex(self):
self.assertEqual(res.columns.tolist(), ['CAP', 'low', 'value'])


class TestGetDummies(unittest.TestCase):
def test_basic(self):
s_list = list('abc')
s_series = Series(s_list)
s_series_index = Series(s_list, list('ABC'))

expected = DataFrame({'a': {0: 1.0, 1: 0.0, 2: 0.0},
'b': {0: 0.0, 1: 1.0, 2: 0.0},
'c': {0: 0.0, 1: 0.0, 2: 1.0}})
assert_frame_equal(get_dummies(s_list), expected)
assert_frame_equal(get_dummies(s_series), expected)

expected.index = list('ABC')
assert_frame_equal(get_dummies(s_series_index), expected)

def test_just_na(self):
just_na_list = [np.nan]
just_na_series = Series(just_na_list)
just_na_series_index = Series(just_na_list, index = ['A'])

res_list = get_dummies(just_na_list)
res_series = get_dummies(just_na_series)
res_series_index = get_dummies(just_na_series_index)

self.assertEqual(res_list.empty, True)
self.assertEqual(res_series.empty, True)
self.assertEqual(res_series_index.empty, True)

self.assertEqual(res_list.index.tolist(), [0])
self.assertEqual(res_series.index.tolist(), [0])
self.assertEqual(res_series_index.index.tolist(), ['A'])

def test_include_na(self):
s = ['a', 'b', np.nan]
res = get_dummies(s)
exp = DataFrame({'a': {0: 1.0, 1: 0.0, 2: 0.0},
'b': {0: 0.0, 1: 1.0, 2: 0.0}})
assert_frame_equal(res, exp)

res_na = get_dummies(s, dummy_na=True)
exp_na = DataFrame({nan: {0: 0.0, 1: 0.0, 2: 1.0},
'a': {0: 1.0, 1: 0.0, 2: 0.0},
'b': {0: 0.0, 1: 1.0, 2: 0.0}}).iloc[:, [1, 2, 0]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! I was just looking at that test before I saw it failed and thinking "hmmm does that work in python 3" - doh!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously should be using exp_na.reindex_axis(['a', 'b', np.nan], 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed fix to master

# hack (NaN handling in assert_index_equal)
exp_na.columns = res_na.columns
assert_frame_equal(res_na, exp_na)

res_just_na = get_dummies([nan], dummy_na=True)
exp_just_na = DataFrame({nan: {0: 1.0}})
# hack (NaN handling in assert_index_equal)
exp_just_na.columns = res_just_na.columns
assert_frame_equal(res_just_na, exp_just_na)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback the weird assert_frame_equal bug is here (if you remove the hack, this fails, and can't repo outside of this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh...i see, nan in indices is very odd (but somewhat supported), prob assert_frame_equal just does .equals on the indicies which I think fails when it has nan...let me look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm..that's not it...let me look further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you read my comment above: #4458 (comment) (I blame numpy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayd I actually think this is a more general issue; your hack ok for now....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny thing is I canno repro this, e.g. Index(['a','b',np.nan]).equals(Index(['a','b',np.nan])) is True!

while in your example, the same is False!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! It's really weird... it's thenp.testing.assert_array_equal which is failing (and it's supposed to ignore nan!). The good thing is, with get_dummies in master we can now repo this. :)



class TestConvertDummies(unittest.TestCase):
def test_convert_dummies(self):
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
Expand Down