PERF faster head, tail and size groupby methods #5518

hayd · 2013-11-15T00:30:38Z

This is some low hanging fruit, significantly faster than master.

To give some numbers, before the change is below the new:

In [1]: df = pd.DataFrame(np.random.randint(0, 100, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [2]: %timeit g.head(2)
1000 loops, best of 3: 429 µs per loop
#100 loops, best of 3: 9.67 ms per loop

In [3]: %timeit g.tail(2)
1000 loops, best of 3: 398 µs per loop
#100 loops, best of 3: 9.68 ms per loop

In [4]: %timeit g.size()
10000 loops, best of 3: 119 µs per loop
#1000 loops, best of 3: 649 µs per loop

In [11]: df = pd.DataFrame(np.random.randint(0, 10, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [12]: %timeit g.head(2)
10000 loops, best of 3: 189 µs per loop
#100 loops, best of 3: 2.1 ms per loop

In [13]: %timeit g.tail(2)
10000 loops, best of 3: 160 µs per loop
#100 loops, best of 3: 2.11 ms per loop

In [14]: %timeit g.size()
10000 loops, best of 3: 41.9 µs per loop
#1000 loops, best of 3: 598 µs per loop

...It's a bit messy to keep track of the as_index stuff (which you get when you apply), see below - am I missing some way to grab out the final index ??

hayd · 2013-11-15T01:35:03Z

related: #5514

hayd · 2013-11-15T18:54:03Z

@jreback @cpcloud @jtratner You guys come across this? Is there a nice internal API to get the would be returned as_index index group a groupby? (If not perhaps there ought to be...?)

jreback · 2013-11-15T18:59:47Z

try g.grouper.group_info....lots of stuff...not exactly sure what you want

hayd · 2013-11-15T19:10:55Z

So when doing these groupby operations cumcount/head/tail I can't keep track of the as_index=True part.

The difference is:

In [11]: g = df.groupby('A', as_index=True)

In [12]: g.head(2)  # my implementation (now fixed in this PR)
Out[12]: 
   A  B
0  1  2
1  1  4
2  3  6

In [13]: g.apply(lambda x: x.head(2))  # what it should be
Out[13]: 
     A  B
A        
1 0  1  2
  1  1  4
3 2  3  6

I was hoping I'd be able to accessing/create the MI easily/efficiently, and thought it would be internal API.

Even combining two Index/MIs I don't know how to do, other than zipping...

hayd · 2013-11-15T19:12:02Z

I'm not sure the as_index thing make sense there anyways, but it seems to be the current API for head/tail.

hayd · 2013-11-15T20:19:56Z

The same "bug" is in filter, i.e. it ignores the as_index.

I have a way to do it, but it... inelegant... and SLOW ~~(uses zip and MultiIndex.from_tuples).~~ with MultiIndex.from_arrays

tbh I think breaking the API might be better :s

hayd · 2013-11-15T22:18:14Z

Well, I can do it, it's not pretty though.

It's still significantly faster than before (but obviously slower than as_index=False). Slight change is that .head/tail keeps the ordering of the original dataframe (rather than in order of the groups), which I think is preferred anyway.

The logic for apply and as_index is a little fishy actually, I wonder if a refactor could fix.

Example:

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [3, 6]], columns=['A', 'B']).set_index('A', append=True, drop=False)

In [2]: g.head(1)  # my way
Out[2]: 
       A  B
A   A      
1 0 1  1  2
3 2 3  3  6

In [3]: g.apply(lambda x: x.head(1))  # apply, loses level name
Out[3]: 
       A  B
A          
1 0 1  1  2
3 2 3  3  6

imo, you never want to as_index with head and tail anyways....

hayd · 2013-11-15T22:42:43Z

pandas/core/groupby.py

-        return bin_counts
+        counts = np.zeros(self.ngroups, dtype='int64')
+        for i, ind in enumerate(self.result_index):
+            counts[i] = len(self.indices[ind])


Not sure how to iterate over this faster (result_index has the correct ordering, indices it's a dict, doesn't)...

Still, around 5+ times faster than, previous implementation.

Actually... in the vbench example this is way slower (30 vs 144ms).... doh! Need to fix that up/revert.

hayd · 2013-11-16T07:12:44Z

This is all fixed up, although probably some more juice to get out here in future...

I added some tests into this for tail as there were none.

I dislike _index_with_as_index and suspect there's a better way...

0.13?

CLN: PEP8 cleanup

hayd reviewed Nov 15, 2013
View reviewed changes

hayd closed this Nov 16, 2013

jtratner and others added 3 commits November 16, 2013 19:46

Merge pull request pandas-dev#5038 from alefnula/pep8

822178e

CLN: PEP8 cleanup

CLN: More autopep8

d250d64

PERF faster head, tail and size groupby methods

ab70cb4

hayd reopened this Nov 17, 2013

hayd closed this Nov 17, 2013

hayd deleted the groupby_head_tail branch November 17, 2013 04:41

hayd restored the groupby_head_tail branch November 17, 2013 04:41

hayd mentioned this pull request Nov 17, 2013

PERF faster head, tail and size groupby methods #5533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF faster head, tail and size groupby methods #5518

PERF faster head, tail and size groupby methods #5518

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

jreback commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd Nov 15, 2013

Uh oh!

hayd Nov 16, 2013

Uh oh!

hayd Nov 16, 2013

Uh oh!

hayd commented Nov 16, 2013

Uh oh!

Uh oh!

Uh oh!

PERF faster head, tail and size groupby methods #5518

PERF faster head, tail and size groupby methods #5518

Uh oh!

Conversation

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

jreback commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd commented Nov 15, 2013

Uh oh!

hayd Nov 15, 2013

Choose a reason for hiding this comment

Uh oh!

hayd Nov 16, 2013

Choose a reason for hiding this comment

Uh oh!

hayd Nov 16, 2013

Choose a reason for hiding this comment

Uh oh!

hayd commented Nov 16, 2013

Uh oh!

Uh oh!