-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix order-preservation in cudf-polars groupby #16907
Conversation
I have pointed this at 24.10, but have marked as do not merge since I don't think this is a critical fix. If others agree, I can retarget to 24.12. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- You said a similar fix is needed in cudf.pandas? This adds a fair number of steps to the computation. Should we do this at the C++ level?
I am happy to merge this for 24.10 and optimize in 24.12, or just target 24.12.
Yes: #16908
It's one additional Note this is only done if the user requests
The optimisations that is plausibly worthwhile is that we know that both tables have distinct rows and so we could use the new distinct_hash_join code in libcudf. As you note, since this kind of thing is a somewhat common pattern, it might be worth implementing a libcudf primitive that, given two tables, returns the gather map that is the permutation from one table to the other. I'd need to look if the join ordering code has the same invariants as the groupby ordering code. |
After discussion, we're moving this to 24.12. |
When we are requested to maintain order in groupby aggregations we must post-process the result by computing a permutation between the wanted order (of the input keys) and the order returned by the groupby aggregation. To do this, we can perform a join between the two unique key tables. Previously, we assumed that the gather map returned in this join for the left (wanted order) table was the identity. However, this is not guaranteed, in addition to computing the match between the wanted key order and the key order we have, we must also apply the permutation between the left gather map order and the identity. - Closes rapidsai#16893
2aa5945
to
1199246
Compare
Done. |
/merge |
Description
When we are requested to maintain order in groupby aggregations we must post-process the result by computing a permutation between the wanted order (of the input keys) and the order returned by the groupby aggregation. To do this, we can perform a join between the two unique key tables. Previously, we assumed that the gather map returned in this join for the left (wanted order) table was the identity. However, this is not guaranteed, in addition to computing the match between the wanted key order and the key order we have, we must also apply the permutation between the left gather map order and the identity.
Checklist