Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make groupby transform-like op order match original data order #8720

Merged
merged 3 commits into from
Aug 4, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jul 12, 2021

Closes #8714

This PR makes transform-like ops return results with orders matching that of inputs. For example: groupby.shift

In [21]: df.head(8)
Out[21]:
   key  val1
0    1    70
1    1    86
2    0    18
3    1    91
4    1    74
5    1    97
6    0    43
7    0    48

In [22]: df.groupby('key').shift(1).head(8)
Out[22]:
   val1
0  <NA>
1    70
2  <NA>
3    86
4    91
5    74
6    18
7    43

This would affect groupby.scan and groupby.shift.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 12, 2021
@isVoid isVoid added feature request New feature or request breaking Breaking change labels Jul 12, 2021
@codecov
Copy link

codecov bot commented Jul 13, 2021

Codecov Report

Merging #8720 (dccabeb) into branch-21.10 (18f7c01) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head dccabeb differs from pull request most recent head d827744. Consider uploading reports for the commit d827744 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10    #8720      +/-   ##
================================================
- Coverage         10.67%   10.61%   -0.07%     
================================================
  Files               110      116       +6     
  Lines             18271    19003     +732     
================================================
+ Hits               1951     2017      +66     
- Misses            16320    16986     +666     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/string.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/struct.py 0.00% <ø> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <ø> (ø)
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f704d6...d827744. Read the comment docs.

@isVoid isVoid marked this pull request as ready for review July 14, 2021 01:40
@isVoid isVoid requested a review from a team as a code owner July 14, 2021 01:40
Table(value_columns._data), periods, fill_value
)
result = self.obj.__class__._from_table(result)
result = self._mimic_pandas_order(result)
Copy link
Contributor

@shwina shwina Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a multi-column sort, which we can avoid by appending a column 0...N to the dataframe before the groupby and then sorting by that single column later.

Copy link
Contributor Author

@isVoid isVoid Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catched up offline, this is certainly a good optimization to our current approach. To achieve this we would require libcudf to perform a "no-op" on the sequence column. However a "no-op" wouldn't fit in our current libcudf aggregation framework because they are required to be binary (reduction) ops.

We discussed alternatives but settled upon it's best to just merge what we have so far and raise an issue to track the optimization thoughts with more people joining the dicussion.

@galipremsagar
Copy link
Contributor

galipremsagar commented Jul 20, 2021

@beckernick The issue(#8714) this PR is fixing was scoped to 21.10. Was it intentional ? If so, I think we need to retarget this PR to 21.10 as it is currently targeted for 21.08.

@harrism harrism changed the base branch from branch-21.08 to branch-21.10 July 21, 2021 22:16
@harrism
Copy link
Member

harrism commented Jul 21, 2021

Going to go ahead and move it.

@isVoid
Copy link
Contributor Author

isVoid commented Jul 27, 2021

rerun tests

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending merge conflicts

@isVoid
Copy link
Contributor Author

isVoid commented Aug 4, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Groupby scans and segmented shift operations do not preserve ordering with original data
5 participants