Skip to content

Get dummies #4458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 26, 2013
Merged

Get dummies #4458

merged 2 commits into from
Aug 26, 2013

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Aug 4, 2013

fixes #4446, #4444

Added new functionality dummy_na (thoughts?). it's slightly different to a possible dropna argument, which I haven't included (which can be achieved using pd.get_dummies(s.dropna()).

Example:

In [3]: s = ['a', 'b', np.nan]

In [4]: pd.get_dummies(s)
Out[4]:
   a  b
0  1  0
1  0  1
2  0  0

In [5]: pd.get_dummies(s, dummy_na=True)
Out[5]:
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1

In [6]: pd.get_dummies(pd.Series(s).dropna())  # different
Out[6]:
   a  b
0  1  0
1  0  1

Note: atm there is a (strange) test Failure with above example, not quite sure what's going on:

res_na = get_dummies(s, dummy_na=True)
exp_na = DataFrame({nan: {0: 0.0, 1: 0.0, 2: 1.0},
                            'a': {0: 1.0, 1: 0.0, 2: 0.0},
                            'b': {0: 0.0, 1: 1.0, 2: 0.0}}).iloc[:, [1, 2, 0]]  # need to reorder cols
assert_frame_equal(res_na, exp_na)

-> assert(left.columns.equals(right.columns))
(Pdb) left
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1
(Pdb) right
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1

@hayd
Copy link
Contributor Author

hayd commented Aug 5, 2013

I think this comes from:

(Pdb) np.testing.assert_array_equal(left.columns.values, right.columns.values)
*** AssertionError:
Arrays are not equal

(mismatch 33.3333333333%)
 x: array(['a', 'b', nan], dtype=object)
 y: array(['a', 'b', nan], dtype=object)

Which is confusing, as assert_array_equal is supposed to treat NaNs are equal...

NaNs are compared like numbers, no assertion is raised if both objects have NaNs in the same positions.

(Pdb) np.isnan(left.columns.values[2]), np.isnan(right.columns.values[2])
(True, True)

(Pdb) left.columns[:2].equals(right.columns[:2])
True

@hayd
Copy link
Contributor Author

hayd commented Aug 25, 2013

the failing build is here: https://travis-ci.org/hayd/pandas/builds/9834728

Workaround by just setting the columns manually (which is kinda annoying, should create separate issue).

@hayd
Copy link
Contributor Author

hayd commented Aug 25, 2013

The other option is to have a more consistent dropna argument which defaults to True. However, it's not really dropping the NaN row, it's just zeroing it. That's my reasoning behind using dummy_na...

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

@jreback speaking of assert_frame_equal did you see this weird bug?

Also, can we merge this?

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd what the assert_frame_equal bug?

I would add an example of get_dummies in v0.13.0txt

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

otherwise looks ok

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

Do you mean in reshaping.rst (it's not a new function btw just untested (!) and undocumented... is in wes' book :) )

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

but didn't you add the NA handling? (if yes, then need an example in whatsnew)

(yes maybe an example/better example in reshape too!)

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

ah, I put it in the release.rst (should it also be somewhere else too?)

exp_just_na = DataFrame({nan: {0: 1.0}})
# hack (NaN handling in assert_index_equal)
exp_just_na.columns = res_just_na.columns
assert_frame_equal(res_just_na, exp_just_na)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback the weird assert_frame_equal bug is here (if you remove the hack, this fails, and can't repo outside of this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh...i see, nan in indices is very odd (but somewhat supported), prob assert_frame_equal just does .equals on the indicies which I think fails when it has nan...let me look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm..that's not it...let me look further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you read my comment above: #4458 (comment) (I blame numpy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hayd I actually think this is a more general issue; your hack ok for now....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny thing is I canno repro this, e.g. Index(['a','b',np.nan]).equals(Index(['a','b',np.nan])) is True!

while in your example, the same is False!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know! It's really weird... it's thenp.testing.assert_array_equal which is failing (and it's supposed to ignore nan!). The good thing is, with get_dummies in master we can now repo this. :)

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd ahh...I just thought a mention of this new 'feature' should be in whatsnew (with an example)....seems like a nice feature

ENH add dummy_na argument to get_dummies

TST add tests for get_dummies
@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

added to what's new.. think i will leave doc writing for another day/pr.

hayd added a commit that referenced this pull request Aug 26, 2013
@hayd hayd merged commit c467051 into pandas-dev:master Aug 26, 2013
@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

wow that build took a while: https://travis-ci.org/hayd/pandas/builds/10644820

whoop, get_dummies my favourite pandas function (now in the docs!)

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd awesom!

@hayd
Copy link
Contributor Author

hayd commented Aug 27, 2013

@jreback crap, this upset travis somehow (which is weird cos the last few commits have been green, and this was before I merged... in fact I linked to it above!) https://travis-ci.org/pydata/pandas/jobs/10646008

res_na = get_dummies(s, dummy_na=True)
exp_na = DataFrame({nan: {0: 0.0, 1: 0.0, 2: 1.0},
'a': {0: 1.0, 1: 0.0, 2: 0.0},
'b': {0: 0.0, 1: 1.0, 2: 0.0}}).iloc[:, [1, 2, 0]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! I was just looking at that test before I saw it failed and thinking "hmmm does that work in python 3" - doh!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously should be using exp_na.reindex_axis(['a', 'b', np.nan], 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed fix to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_dummies with NaN
2 participants