-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF faster head, tail and size groupby methods #5518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
related: #5514 |
try |
So when doing these groupby operations cumcount/head/tail I can't keep track of the as_index=True part. The difference is:
I was hoping I'd be able to accessing/create the MI easily/efficiently, and thought it would be internal API. Even combining two Index/MIs I don't know how to do, other than zipping... |
I'm not sure the as_index thing make sense there anyways, but it seems to be the current API for head/tail. |
The same "bug" is in filter, i.e. it ignores the as_index. I have a way to do it, but it... inelegant... and SLOW tbh I think breaking the API might be better :s |
Well, I can do it, it's not pretty though. It's still significantly faster than before (but obviously slower than as_index=False). Slight change is that .head/tail keeps the ordering of the original dataframe (rather than in order of the groups), which I think is preferred anyway. The logic for apply and as_index is a little fishy actually, I wonder if a refactor could fix. Example:
imo, you never want to as_index with head and tail anyways.... |
return bin_counts | ||
counts = np.zeros(self.ngroups, dtype='int64') | ||
for i, ind in enumerate(self.result_index): | ||
counts[i] = len(self.indices[ind]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to iterate over this faster (result_index has the correct ordering, indices it's a dict, doesn't)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, around 5+ times faster than, previous implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... in the vbench example this is way slower (30 vs 144ms).... doh! Need to fix that up/revert.
This is all fixed up, although probably some more juice to get out here in future... I added some tests into this for tail as there were none. I dislike 0.13? |
This is some low hanging fruit, significantly faster than master.
To give some numbers, before the change is below the new:
...It's a bit messy to keep track of the as_index stuff (which you get when you apply), see below - am I missing some way to grab out the final index ??