Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelizable DataFrame/Series mean #22174

Merged
merged 1 commit into from
Jul 10, 2022

Conversation

TheNeuralBit
Copy link
Member

@TheNeuralBit TheNeuralBit commented Jul 6, 2022

Fixes #22171

This adds a parallelizable custom implementation of DeferredSeries.mean, which uses sum()/count(). In addition:

  • DeferredDataFrame leverages this implementation through _agg_method
  • Updates tests which previously verified that mean is not parallelizable. Added new tests for mean(skipna=False).

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the python label Jul 6, 2022
@codecov
Copy link

codecov bot commented Jul 6, 2022

Codecov Report

Merging #22174 (a21bda8) into master (a21bda8) will not change coverage.
The diff coverage is n/a.

❗ Current head a21bda8 differs from pull request most recent head 24bf066. Consider uploading reports for the commit 24bf066 to get more accurate results

@@           Coverage Diff           @@
##           master   #22174   +/-   ##
=======================================
  Coverage   74.21%   74.21%           
=======================================
  Files         702      702           
  Lines       92829    92829           
=======================================
  Hits        68892    68892           
  Misses      22670    22670           
  Partials     1267     1267           
Flag Coverage Δ
go 51.50% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a21bda8...24bf066. Read the comment docs.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@TheNeuralBit
Copy link
Member Author

Run Python PreCommit

@TheNeuralBit TheNeuralBit changed the title Parallelizable mean Parallelizable DataFrame/Series mean Jul 6, 2022
@TheNeuralBit
Copy link
Member Author

Run Python PreCommit

Copy link
Contributor

@AnandInguva AnandInguva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@frame_base.with_docs_from(pd.Series)
@frame_base.args_to_kwargs(pd.Series)
@frame_base.populate_defaults(pd.Series)
def mean(self, skipna, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make skipna as keyword argument?

Suggested change
def mean(self, skipna, **kwargs):
def mean(self, skipna=True, **kwargs):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually what the decorators do - populate_defaults pulls the default values from the panda implementation of the function, so we don't have to duplicate them here.

@TheNeuralBit TheNeuralBit merged commit 262f2b7 into apache:master Jul 10, 2022
konstantinurysov pushed a commit to akvelon/beam that referenced this pull request Jul 14, 2022
lostluck pushed a commit to lostluck/beam that referenced this pull request Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: parallelizable df.mean
2 participants