Skip to content

ENH get_dummies str method #6132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 28, 2014
Merged

ENH get_dummies str method #6132

merged 2 commits into from
Jan 28, 2014

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Jan 27, 2014

fixes #3695.

In [8]: s = pd.Series(['a|b', 'a|c', np.nan], list('ABC'))

In [9]: s.str.get_dummies('|')
Out[9]: 
   a  b  c
A  1  1  0
B  1  0  1
C  0  0  0

iterates over each tag filling in with str.contains.

Edit: This works pretty fast with "few" tags (since I'm iterating over each tag), which I think is the main usecase, but I'm sure some more perf can be eeked out:

In [6]: %timeit genres.str.get_dummies('|')
10 loops, best of 3: 20.9 ms per loop

In [7]: %timeit genres.str.split('|').apply(lambda x: pd.Series(1., x)).fillna(0)
1 loops, best of 3: 623 ms per loop

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

atm this doesn't work with floats (other than NaN), I have a fix makes NaN rows go to 0... this may be preferable? In fact, it would be more in line with get_dummies itself...

@jreback This could actually sneak into 0.13.1 ??

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

sure....

pls dd release note / v0.13.1 (if you want) and to API docs

FIX/DOC py3 and add docstrings

DOC add in get_dummies to release and docs
@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

@jreback Added in release and some docs.

Also, fixed for floats/ints, so you can do:

In [7]: s = pd.Series(['a|b', 'a|c', 1], list('ABC'))

In [8]: s.str.get_dummies('|')
Out[8]: 
   1  a  b  c
A  0  1  1  0
B  0  1  0  1
C  1  0  0  0

tested for but purposely not in docs, as I don't think it's what people should be doing! ...For one thing it won't play nice with 3 (int) and 3. (float) in an object Series.

@hayd
Copy link
Contributor Author

hayd commented Jan 27, 2014

Ahhhhhhh, I'm confused just retested the timings is now slower. Confusing (since it was just as fast when I pushed).

@hayd
Copy link
Contributor Author

hayd commented Jan 28, 2014

meh. timings seem eratic. It's still 40ms vs 500, so it's pretty good. I just thought I had this down to 20 (for Wes' movie dataset).

@hayd
Copy link
Contributor Author

hayd commented Jan 28, 2014

Ok, well refactored it to be nicer, and perhaps faster...

Definitely erratic though:

In [2]: s = pd.read_pickle('foo')

In [3]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 26.8 ms per loop

In [4]: s = pd.read_pickle('foo')

In [5]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 16.5 ms per loop

In [6]: s = pd.read_pickle('foo')

In [7]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 16.6 ms per loop

hayd added a commit that referenced this pull request Jan 28, 2014
@hayd hayd merged commit f89ae34 into pandas-dev:master Jan 28, 2014
@hayd hayd deleted the str_get_dummies branch January 28, 2014 00:52
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
s.str.get_dummies(sep='|')

See also ``pd.get_dummies``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think you need some kind of ref here?

@jorisvandenbossche

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's like this (looking at a recent PR):

:func:`~pandas.get_dummies`
:func:`~pandas.Series.str.get_dummies`

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

and does this go in the API docs automagically? or need to define in api.rst?

@hayd
Copy link
Contributor Author

hayd commented Jan 28, 2014

@jreback Ah, good point. I will add it to the api rst.

@hayd
Copy link
Contributor Author

hayd commented Jan 28, 2014

Will push to master assuming the function ref is correct.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

gr8 thanks!

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

seems your tests are breaking windows: http://scatterci.github.io/ScatterCI-Pandas/

tags.update(ts)
tags = sorted(tags - set([""]))

dummies = np.empty((len(arr), len(tags)), dtype=int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be dtype=np.int64..then all should be good

@hayd
Copy link
Contributor Author

hayd commented Jan 28, 2014

pushed (along with your fix), ooops about Windows. Sorry! Thanks for the fix!

(Edit: looks fixed in scatter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tags to dummies helper function
2 participants