-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH get_dummies str method #6132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
atm this doesn't work with floats (other than NaN), I have a fix makes NaN rows go to 0... this may be preferable? In fact, it would be more in line with get_dummies itself... @jreback This could actually sneak into 0.13.1 ?? |
sure.... pls dd release note / v0.13.1 (if you want) and to API docs |
FIX/DOC py3 and add docstrings DOC add in get_dummies to release and docs
@jreback Added in release and some docs. Also, fixed for floats/ints, so you can do:
tested for but purposely not in docs, as I don't think it's what people should be doing! ...For one thing it won't play nice with 3 (int) and 3. (float) in an object Series. |
Ahhhhhhh, I'm confused just retested the timings is now slower. Confusing (since it was just as fast when I pushed). |
meh. timings seem eratic. It's still 40ms vs 500, so it's pretty good. I just thought I had this down to 20 (for Wes' movie dataset). |
Ok, well refactored it to be nicer, and perhaps faster... Definitely erratic though:
|
s = pd.Series(['a', 'a|b', np.nan, 'a|c']) | ||
s.str.get_dummies(sep='|') | ||
|
||
See also ``pd.get_dummies``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I think you need some kind of ref here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's like this (looking at a recent PR):
:func:`~pandas.get_dummies`
:func:`~pandas.Series.str.get_dummies`
and does this go in the API docs automagically? or need to define in api.rst? |
@jreback Ah, good point. I will add it to the api rst. |
Will push to master assuming the function ref is correct. |
gr8 thanks! |
seems your tests are breaking windows: http://scatterci.github.io/ScatterCI-Pandas/ |
tags.update(ts) | ||
tags = sorted(tags - set([""])) | ||
|
||
dummies = np.empty((len(arr), len(tags)), dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be dtype=np.int64
..then all should be good
pushed (along with your fix), ooops about Windows. Sorry! Thanks for the fix! (Edit: looks fixed in scatter) |
fixes #3695.
iterates over each tag filling in with str.contains.
Edit: This works pretty fast with "few" tags (since I'm iterating over each tag), which I think is the main usecase, but I'm sure some more perf can be eeked out: