ENH get_dummies str method #6132

hayd · 2014-01-27T21:04:44Z

In [8]: s = pd.Series(['a|b', 'a|c', np.nan], list('ABC'))

In [9]: s.str.get_dummies('|')
Out[9]: 
   a  b  c
A  1  1  0
B  1  0  1
C  0  0  0

iterates over each tag filling in with str.contains.

Edit: This works pretty fast with "few" tags (since I'm iterating over each tag), which I think is the main usecase, but I'm sure some more perf can be eeked out:

In [6]: %timeit genres.str.get_dummies('|')
10 loops, best of 3: 20.9 ms per loop

In [7]: %timeit genres.str.split('|').apply(lambda x: pd.Series(1., x)).fillna(0)
1 loops, best of 3: 623 ms per loop

hayd · 2014-01-27T22:28:29Z

atm this doesn't work with floats (other than NaN), I have a fix makes NaN rows go to 0... this may be preferable? In fact, it would be more in line with get_dummies itself...

@jreback This could actually sneak into 0.13.1 ??

jreback · 2014-01-27T22:41:59Z

sure....

pls dd release note / v0.13.1 (if you want) and to API docs

FIX/DOC py3 and add docstrings DOC add in get_dummies to release and docs

hayd · 2014-01-27T23:17:22Z

@jreback Added in release and some docs.

Also, fixed for floats/ints, so you can do:

In [7]: s = pd.Series(['a|b', 'a|c', 1], list('ABC'))

In [8]: s.str.get_dummies('|')
Out[8]: 
   1  a  b  c
A  0  1  1  0
B  0  1  0  1
C  1  0  0  0

tested for but purposely not in docs, as I don't think it's what people should be doing! ...For one thing it won't play nice with 3 (int) and 3. (float) in an object Series.

hayd · 2014-01-27T23:22:00Z

Ahhhhhhh, I'm confused just retested the timings is now slower. Confusing (since it was just as fast when I pushed).

hayd · 2014-01-28T00:11:17Z

meh. timings seem eratic. It's still 40ms vs 500, so it's pretty good. I just thought I had this down to 20 (for Wes' movie dataset).

hayd · 2014-01-28T00:27:20Z

Ok, well refactored it to be nicer, and perhaps faster...

Definitely erratic though:

In [2]: s = pd.read_pickle('foo')

In [3]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 26.8 ms per loop

In [4]: s = pd.read_pickle('foo')

In [5]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 16.5 ms per loop

In [6]: s = pd.read_pickle('foo')

In [7]: %timeit -n 200 s.str.get_dummies('|')
200 loops, best of 3: 16.6 ms per loop

ENH get_dummies str method

jreback · 2014-01-28T01:00:53Z

doc/source/basics.rst

+      s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
+      s.str.get_dummies(sep='|')
+
+See also ``pd.get_dummies``.


FYI I think you need some kind of ref here?

@jorisvandenbossche

I think it's like this (looking at a recent PR):

:func:`~pandas.get_dummies` :func:`~pandas.Series.str.get_dummies`

jreback · 2014-01-28T01:01:41Z

and does this go in the API docs automagically? or need to define in api.rst?

hayd · 2014-01-28T01:05:32Z

@jreback Ah, good point. I will add it to the api rst.

hayd · 2014-01-28T01:11:23Z

Will push to master assuming the function ref is correct.

jreback · 2014-01-28T01:15:17Z

gr8 thanks!

jreback · 2014-01-28T01:16:13Z

seems your tests are breaking windows: http://scatterci.github.io/ScatterCI-Pandas/

jreback · 2014-01-28T01:17:47Z

pandas/core/strings.py

+        tags.update(ts)
+    tags = sorted(tags - set([""]))
+
+    dummies = np.empty((len(arr), len(tags)), dtype=int)


this needs to be dtype=np.int64..then all should be good

hayd · 2014-01-28T02:28:54Z

pushed (along with your fix), ooops about Windows. Sorry! Thanks for the fix!

(Edit: looks fixed in scatter)

hayd mentioned this pull request Jan 27, 2014

Tags to dummies helper function #3695

Closed

ENH get_dummies str method

2c5e3d3

FIX/DOC py3 and add docstrings DOC add in get_dummies to release and docs

PERF speed up str.get_dummies

d8f94e9

hayd added a commit that referenced this pull request Jan 28, 2014

Merge pull request #6132 from hayd/str_get_dummies

f89ae34

ENH get_dummies str method

hayd merged commit f89ae34 into pandas-dev:master Jan 28, 2014

hayd deleted the str_get_dummies branch January 28, 2014 00:52

jreback reviewed Jan 28, 2014
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH get_dummies str method #6132

ENH get_dummies str method #6132

hayd commented Jan 27, 2014

hayd commented Jan 27, 2014

jreback commented Jan 27, 2014

hayd commented Jan 27, 2014

hayd commented Jan 27, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

jreback Jan 28, 2014

hayd Jan 28, 2014

jreback commented Jan 28, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback Jan 28, 2014

hayd commented Jan 28, 2014

ENH get_dummies str method #6132

ENH get_dummies str method #6132

Conversation

hayd commented Jan 27, 2014

hayd commented Jan 27, 2014

jreback commented Jan 27, 2014

hayd commented Jan 27, 2014

hayd commented Jan 27, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

jreback Jan 28, 2014

Choose a reason for hiding this comment

hayd Jan 28, 2014

Choose a reason for hiding this comment

jreback commented Jan 28, 2014

hayd commented Jan 28, 2014

hayd commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback commented Jan 28, 2014

jreback Jan 28, 2014

Choose a reason for hiding this comment

hayd commented Jan 28, 2014