Get dummies #4458

hayd · 2013-08-04T17:32:01Z

Added new functionality dummy_na (thoughts?). it's slightly different to a possible dropna argument, which I haven't included (which can be achieved using pd.get_dummies(s.dropna()).

Example:

In [3]: s = ['a', 'b', np.nan]

In [4]: pd.get_dummies(s)
Out[4]:
   a  b
0  1  0
1  0  1
2  0  0

In [5]: pd.get_dummies(s, dummy_na=True)
Out[5]:
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1

In [6]: pd.get_dummies(pd.Series(s).dropna())  # different
Out[6]:
   a  b
0  1  0
1  0  1

Note: atm there is a (strange) test Failure with above example, not quite sure what's going on:

res_na = get_dummies(s, dummy_na=True)
exp_na = DataFrame({nan: {0: 0.0, 1: 0.0, 2: 1.0},
                            'a': {0: 1.0, 1: 0.0, 2: 0.0},
                            'b': {0: 0.0, 1: 1.0, 2: 0.0}}).iloc[:, [1, 2, 0]]  # need to reorder cols
assert_frame_equal(res_na, exp_na)

-> assert(left.columns.equals(right.columns))
(Pdb) left
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1
(Pdb) right
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1

hayd · 2013-08-05T12:51:21Z

I think this comes from:

(Pdb) np.testing.assert_array_equal(left.columns.values, right.columns.values)
*** AssertionError:
Arrays are not equal

(mismatch 33.3333333333%)
 x: array(['a', 'b', nan], dtype=object)
 y: array(['a', 'b', nan], dtype=object)

Which is confusing, as assert_array_equal is supposed to treat NaNs are equal...

NaNs are compared like numbers, no assertion is raised if both objects have NaNs in the same positions.

(Pdb) np.isnan(left.columns.values[2]), np.isnan(right.columns.values[2])
(True, True)

(Pdb) left.columns[:2].equals(right.columns[:2])
True

hayd · 2013-08-25T23:34:40Z

the failing build is here: https://travis-ci.org/hayd/pandas/builds/9834728

Workaround by just setting the columns manually (which is kinda annoying, should create separate issue).

hayd · 2013-08-25T23:42:13Z

The other option is to have a more consistent dropna argument which defaults to True. However, it's not really dropping the NaN row, it's just zeroing it. That's my reasoning behind using dummy_na...

hayd · 2013-08-26T22:19:27Z

@jreback speaking of assert_frame_equal did you see this weird bug?

Also, can we merge this?

jreback · 2013-08-26T22:28:04Z

@hayd what the assert_frame_equal bug?

I would add an example of get_dummies in v0.13.0txt

jreback · 2013-08-26T22:28:10Z

otherwise looks ok

hayd · 2013-08-26T22:33:41Z

Do you mean in reshaping.rst (it's not a new function btw just untested (!) and undocumented... is in wes' book :) )

jreback · 2013-08-26T22:37:38Z

but didn't you add the NA handling? (if yes, then need an example in whatsnew)

(yes maybe an example/better example in reshape too!)

hayd · 2013-08-26T22:39:32Z

ah, I put it in the release.rst (should it also be somewhere else too?)

hayd · 2013-08-26T22:41:11Z

pandas/tests/test_reshape.py

+        exp_just_na = DataFrame({nan: {0: 1.0}})
+        # hack (NaN handling in assert_index_equal)
+        exp_just_na.columns = res_just_na.columns
+        assert_frame_equal(res_just_na, exp_just_na)


@jreback the weird assert_frame_equal bug is here (if you remove the hack, this fails, and can't repo outside of this)

ahh...i see, nan in indices is very odd (but somewhat supported), prob assert_frame_equal just does .equals on the indicies which I think fails when it has nan...let me look

hmm..that's not it...let me look further

I don't know if you read my comment above: #4458 (comment) (I blame numpy)

@hayd I actually think this is a more general issue; your hack ok for now....

funny thing is I canno repro this, e.g. Index(['a','b',np.nan]).equals(Index(['a','b',np.nan])) is True!

while in your example, the same is False!

I know! It's really weird... it's thenp.testing.assert_array_equal which is failing (and it's supposed to ignore nan!). The good thing is, with get_dummies in master we can now repo this. :)

jreback · 2013-08-26T22:43:14Z

@hayd ahh...I just thought a mention of this new 'feature' should be in whatsnew (with an example)....seems like a nice feature

ENH add dummy_na argument to get_dummies TST add tests for get_dummies

hayd · 2013-08-26T23:06:17Z

added to what's new.. think i will leave doc writing for another day/pr.

Get dummies

hayd · 2013-08-26T23:44:08Z

wow that build took a while: https://travis-ci.org/hayd/pandas/builds/10644820

whoop, get_dummies my favourite pandas function (now in the docs!)

jreback · 2013-08-26T23:48:00Z

@hayd awesom!

hayd · 2013-08-27T00:15:39Z

@jreback crap, this upset travis somehow (which is weird cos the last few commits have been green, and this was before I merged... in fact I linked to it above!) https://travis-ci.org/pydata/pandas/jobs/10646008

hayd · 2013-08-27T00:17:37Z

pandas/tests/test_reshape.py

+        res_na = get_dummies(s, dummy_na=True)
+        exp_na = DataFrame({nan: {0: 0.0, 1: 0.0, 2: 1.0},
+                            'a': {0: 1.0, 1: 0.0, 2: 0.0},
+                            'b': {0: 0.0, 1: 1.0, 2: 0.0}}).iloc[:, [1, 2, 0]]


ha! I was just looking at that test before I saw it failed and thinking "hmmm does that work in python 3" - doh!

obviously should be using exp_na.reindex_axis(['a', 'b', np.nan], 1)

pushed fix to master

hayd mentioned this pull request Aug 6, 2013

ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039

Merged

hayd reviewed Aug 26, 2013
View reviewed changes

DOC add get_dummies to rst, example to docstring

6bf01ae

ENH add dummy_na argument to get_dummies TST add tests for get_dummies

DOC add get_dummies NaN to whatsnew

8765288

hayd added a commit that referenced this pull request Aug 26, 2013

Merge pull request #4458 from hayd/get_dummies

c467051

Get dummies

hayd merged commit c467051 into pandas-dev:master Aug 26, 2013

hayd reviewed Aug 27, 2013
View reviewed changes

hayd mentioned this pull request Oct 21, 2013

API: Add equals method to NDFrames. #5283

Merged

Uh oh!

Get dummies #4458

Get dummies #4458

Uh oh!

Conversation

hayd commented Aug 4, 2013

Uh oh!

hayd commented Aug 5, 2013

Uh oh!

hayd commented Aug 25, 2013

Uh oh!

hayd commented Aug 25, 2013

Uh oh!

hayd commented Aug 26, 2013

Uh oh!

jreback commented Aug 26, 2013

Uh oh!

jreback commented Aug 26, 2013

Uh oh!

hayd commented Aug 26, 2013

Uh oh!

jreback commented Aug 26, 2013

Uh oh!

hayd commented Aug 26, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 26, 2013

Uh oh!

hayd commented Aug 26, 2013

Uh oh!

hayd commented Aug 26, 2013

Uh oh!

jreback commented Aug 26, 2013

Uh oh!

hayd commented Aug 27, 2013

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!