Skip to content

PERF faster head, tail and size groupby methods #5518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Nov 15, 2013

This is some low hanging fruit, significantly faster than master.

To give some numbers, before the change is below the new:

In [1]: df = pd.DataFrame(np.random.randint(0, 100, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [2]: %timeit g.head(2)
1000 loops, best of 3: 429 µs per loop
#100 loops, best of 3: 9.67 ms per loop

In [3]: %timeit g.tail(2)
1000 loops, best of 3: 398 µs per loop
#100 loops, best of 3: 9.68 ms per loop

In [4]: %timeit g.size()
10000 loops, best of 3: 119 µs per loop
#1000 loops, best of 3: 649 µs per loop

In [11]: df = pd.DataFrame(np.random.randint(0, 10, 1000), columns=['A'], index=[0] * 1000); g = df.groupby('A')

In [12]: %timeit g.head(2)
10000 loops, best of 3: 189 µs per loop
#100 loops, best of 3: 2.1 ms per loop

In [13]: %timeit g.tail(2)
10000 loops, best of 3: 160 µs per loop
#100 loops, best of 3: 2.11 ms per loop

In [14]: %timeit g.size()
10000 loops, best of 3: 41.9 µs per loop
#1000 loops, best of 3: 598 µs per loop

...It's a bit messy to keep track of the as_index stuff (which you get when you apply), see below - am I missing some way to grab out the final index ??

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

related: #5514

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

@jreback @cpcloud @jtratner You guys come across this? Is there a nice internal API to get the would be returned as_index index group a groupby? (If not perhaps there ought to be...?)

@jreback
Copy link
Contributor

jreback commented Nov 15, 2013

try g.grouper.group_info....lots of stuff...not exactly sure what you want

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

So when doing these groupby operations cumcount/head/tail I can't keep track of the as_index=True part.

The difference is:

In [11]: g = df.groupby('A', as_index=True)

In [12]: g.head(2)  # my implementation (now fixed in this PR)
Out[12]: 
   A  B
0  1  2
1  1  4
2  3  6

In [13]: g.apply(lambda x: x.head(2))  # what it should be
Out[13]: 
     A  B
A        
1 0  1  2
  1  1  4
3 2  3  6

I was hoping I'd be able to accessing/create the MI easily/efficiently, and thought it would be internal API.

Even combining two Index/MIs I don't know how to do, other than zipping...

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

I'm not sure the as_index thing make sense there anyways, but it seems to be the current API for head/tail.

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

The same "bug" is in filter, i.e. it ignores the as_index.

I have a way to do it, but it... inelegant... and SLOW (uses zip and MultiIndex.from_tuples). with MultiIndex.from_arrays

tbh I think breaking the API might be better :s

@hayd
Copy link
Contributor Author

hayd commented Nov 15, 2013

Well, I can do it, it's not pretty though.

It's still significantly faster than before (but obviously slower than as_index=False). Slight change is that .head/tail keeps the ordering of the original dataframe (rather than in order of the groups), which I think is preferred anyway.

The logic for apply and as_index is a little fishy actually, I wonder if a refactor could fix.

Example:

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [3, 6]], columns=['A', 'B']).set_index('A', append=True, drop=False)

In [2]: g.head(1)  # my way
Out[2]: 
       A  B
A   A      
1 0 1  1  2
3 2 3  3  6

In [3]: g.apply(lambda x: x.head(1))  # apply, loses level name
Out[3]: 
       A  B
A          
1 0 1  1  2
3 2 3  3  6

imo, you never want to as_index with head and tail anyways....

return bin_counts
counts = np.zeros(self.ngroups, dtype='int64')
for i, ind in enumerate(self.result_index):
counts[i] = len(self.indices[ind])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to iterate over this faster (result_index has the correct ordering, indices it's a dict, doesn't)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, around 5+ times faster than, previous implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... in the vbench example this is way slower (30 vs 144ms).... doh! Need to fix that up/revert.

@hayd
Copy link
Contributor Author

hayd commented Nov 16, 2013

This is all fixed up, although probably some more juice to get out here in future...

I added some tests into this for tail as there were none.

I dislike _index_with_as_index and suspect there's a better way...

0.13?

@hayd hayd closed this Nov 16, 2013
@hayd hayd reopened this Nov 17, 2013
@hayd hayd closed this Nov 17, 2013
@hayd hayd deleted the groupby_head_tail branch November 17, 2013 04:41
@hayd hayd restored the groupby_head_tail branch November 17, 2013 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants