Skip to content

Tags to dummies helper function #3695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue May 25, 2013 · 7 comments · Fixed by #6132
Closed

Tags to dummies helper function #3695

wesm opened this issue May 25, 2013 · 7 comments · Fixed by #6132
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@wesm
Copy link
Member

wesm commented May 25, 2013

from @jreback's clever data alignment trick: http://stackoverflow.com/questions/16637171/pandas-reshaping-data

not sure where this should go, it does come up quite frequently, e.g. the Movielens data set https://raw.github.com/pydata/pydata-book/master/ch02/movielens/movies.dat:

df = read_table('https://raw.github.com/pydata/pydata-book/master/ch02/movielens/movies.dat', header=None, sep='::')

In [10]: genres = df[2].str.split('|')

In [11]: genres
Out[11]: 
0      [Animation, Children's, Comedy]
1     [Adventure, Children's, Fantasy]
2                    [Comedy, Romance]
3                      [Comedy, Drama]
4                             [Comedy]
5            [Action, Crime, Thriller]
6                    [Comedy, Romance]
7              [Adventure, Children's]
8                             [Action]
9        [Action, Adventure, Thriller]
10            [Comedy, Drama, Romance]
11                    [Comedy, Horror]
12             [Animation, Children's]
13                             [Drama]
14        [Action, Adventure, Romance]
...
3868                              [Horror]
3869                              [Horror]
3870                              [Horror]
3871                              [Horror]
3872                              [Horror]
3873                              [Comedy]
3874                       [Comedy, Drama]
3875    [Adventure, Animation, Children's]
3876             [Action, Drama, Thriller]
3877                            [Thriller]
3878                              [Comedy]
3879                               [Drama]
3880                               [Drama]
3881                               [Drama]
3882                     [Drama, Thriller]
Name: 2, Length: 3883, dtype: object

In [12]: dummies = genres.apply(lambda x: Series(1, index=x)).fillna(0)

In [13]: dummies[:4].T
Out[13]: 
             0  1  2  3
Action       0  0  0  0
Adventure    0  1  0  0
Animation    1  0  0  0
Children's   1  1  0  0
Comedy       1  0  1  1
Crime        0  0  0  0
Documentary  0  0  0  0
Drama        0  0  0  1
Fantasy      0  1  0  0
Film-Noir    0  0  0  0
Horror       0  0  0  0
Musical      0  0  0  0
Mystery      0  0  0  0
Romance      0  0  1  0
Sci-Fi       0  0  0  0
Thriller     0  0  0  0
War          0  0  0  0
Western      0  0  0  0

@wesm
Copy link
Member Author

wesm commented May 25, 2013

N.B. this method is a bit slow, something more optimized would be nice

@hayd
Copy link
Contributor

hayd commented Oct 21, 2013

Perhaps get_dummies could at the same time be made a Series method (as well as being top level)... Or does Series.str.get_dummies (also??) make sense for this delim-ing...

@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

I don't think this should be a string method

you can always s.extract().get_dummies() which will operate in iterables inside the series elements (iow lists)

@ghost ghost assigned hayd Oct 21, 2013
@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

assinging you for 0.14!

@hayd
Copy link
Contributor

hayd commented Oct 22, 2013

sounds good :)

@hayd
Copy link
Contributor

hayd commented Oct 24, 2013

While I'm looking at get_dummies, may also be a good idea to add bins argument, to exactly cover the use case with a cut (we already do the same with value_counts, which is related).

Edit: Not sure what I was talking about re bins (doesn't make sense for categorical), additional munging can be done after I guess.

@hayd
Copy link
Contributor

hayd commented Jan 27, 2014

I think it makes sense to iterate the other way (over the tags), in this specific example it makes it a lot faster. If there were a lot of tags perhaps it would be slower. With the #6132:

In [6]: %timeit genres.str.get_dummies('|')
10 loops, best of 3: 20.9 ms per loop

In [7]: %timeit genres.str.split('|').apply(lambda x: pd.Series(1., x)).fillna(0)
1 loops, best of 3: 623 ms per loop

@wesm wesm unassigned hayd Oct 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants