Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize and simplify iris.analysis._Groupby #5015

Merged
merged 7 commits into from
Apr 12, 2023

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Oct 6, 2022

🚀 Pull Request

Description

Modernize and simplify iris.analysis._Groupby. This started out from the idea discussed in #4956.

Here is an overview of some of the improvements here:

  • Replace iris.coords._GroupIterator by the standard library function itertools.groupby
  • Fix the iris.analysis._Group.__repr__ method, it seems currently broken because it uses an attribute that the class doesn't have (anymore). I suspect that this code is not covered by tests Consider using a code coverage service #5014
  • Add type hints
  • Simplify the code by removing the special treatment of groupby_slices if it had a length of 1
  • Simplify the code by removing the unused dictionary keys stored in _Groupby._slices_by_key and instead using a list called _Groupby._groupby_indices
  • Simplfiy the code of _Groupby.group by returning an empty list if _Groupby._groupby_coords was empty. Previously there was a special if statement to make it return None, but this behaviour was not used anywhere in the code.

Consult Iris pull request check list

@bouweandela bouweandela marked this pull request as ready for review October 7, 2022 09:12
@bjlittle bjlittle removed their assignment Feb 22, 2023
@trexfeathers trexfeathers self-requested a review February 22, 2023 11:27
@codecov
Copy link

codecov bot commented Feb 23, 2023

Codecov Report

Patch coverage: 95.00% and project coverage change: +0.01 🎉

Comparison is base (6b25c3a) 89.28% compared to head (a70aea2) 89.29%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5015      +/-   ##
==========================================
+ Coverage   89.28%   89.29%   +0.01%     
==========================================
  Files          88       88              
  Lines       22269    22228      -41     
  Branches     4870     4865       -5     
==========================================
- Hits        19882    19849      -33     
+ Misses       1641     1634       -7     
+ Partials      746      745       -1     
Impacted Files Coverage Δ
lib/iris/analysis/__init__.py 92.44% <94.91%> (+0.54%) ⬆️
lib/iris/coords.py 93.27% <100.00%> (+0.07%) ⬆️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @bouweandela, and sorry for the delay.

I'm all for anything maintains functionality while doing this:

image

❗ This deserves a What's New entry.

Responses to bullet points

  • Replace iris.coords._GroupIterator by the standard library function itertools.groupby

👍 From debugging through the old and new version of group(), I can see that this works in an equivalent way.

  • Fix the iris.analysis._Group.repr method, it seems currently broken because it uses an attribute that the class doesn't have (anymore). I suspect that this code is not covered by tests Consider using a code coverage service #5014

👍 A definite improvement! I've left some comments.

  • Add type hints

👍 Many thanks for doing this 🙂 I have checked for consistency.

  • Simplify the code by removing the special treatment of groupby_slices if it had a length of 1

👍 I assume you're referring to _slice_merge()? I agree with the decision - see the relevant comment.

  • Simplify the code by removing the unused dictionary keys stored in _Groupby._slices_by_key and instead using a list called _Groupby._groupby_indices

👍

  • Replace the use of OrderedDict by dict, since dicts are ordered since Python 3.7

❓ I think this point is redundant thanks to making the _Groupby._groupby_indices list?

  • Simplfiy the code of _Groupby.group by returning an empty list if _Groupby._groupby_coords was empty. Previously there was a special if statement to make it return None, but this behaviour was not used anywhere in the code.

👍 I think the upstream code protects against there being no coordinates, and the downstream code wouldn't be able to handle None if that was returned.

lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
lib/iris/analysis/__init__.py Outdated Show resolved Hide resolved
lib/iris/analysis/__init__.py Show resolved Hide resolved
@trexfeathers
Copy link
Contributor

@bouweandela any comment on my comments?

@bouweandela
Copy link
Member Author

Thanks for reviewing @trexfeathers! I ran out of time for today, will continue tomorrow.

@bouweandela
Copy link
Member Author

@trexfeathers This is ready for another round of reviewing. Could you have a look, please?

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @bouweandela 🍻

lib/iris/analysis/__init__.py Show resolved Hide resolved
@trexfeathers trexfeathers merged commit 5585684 into SciTools:main Apr 12, 2023
@bouweandela bouweandela deleted the update-groupby branch April 12, 2023 12:50
tkknight added a commit to tkknight/iris that referenced this pull request Apr 13, 2023
* upstream/main: (59 commits)
  Updated environment lockfiles (SciTools#5211)
  update ci locks location (SciTools#5228)
  Fixes to _discontiguity_in_bounds (attempt 2) (SciTools#4975)
  Finalises Lazy Data documentation (SciTools#5137)
  Modernize and simplify iris.analysis._Groupby (SciTools#5015)
  clarity on whatsnew entry contributors (SciTools#5240)
  Handle derived coordinates correctly in `concatenate` (SciTools#5096)
  Use real array for data of of small netCDF variables. (SciTools#5229)
  Bump scitools/workflows from 2023.04.1 to 2023.04.2 (SciTools#5236)
  fixing whatsnew entry
  remove results creation commit from blame
  configure codecov
  adding a whatsnew entry
  Replacing numpy legacy printing with array2string and remaking results for dependent tests
  Adding a whatsnew entry for 5224 (SciTools#5234)
  Cf cell method (SciTools#5224)
  Bump scitools/workflows from 2023.03.3 to 2023.04.1 (SciTools#5231)
  [pre-commit.ci] pre-commit autoupdate (SciTools#5230)
  Bump scitools/workflows from 2023.03.2 to 2023.03.3 (SciTools#5227)
  raise dask min pin (SciTools#5225)
  ...
pp-mo pushed a commit to pp-mo/iris that referenced this pull request Apr 13, 2023
* Modernize and simplify _Groupby

* Rename variable to improve readability

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Add a whatsnew entry

* Add a type hint to _add_shared_coord

* Add a test for iris.analysis._Groupby.__repr__

---------

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>
tkknight added a commit to tkknight/iris that referenced this pull request Apr 18, 2023
* upstream/main: (29 commits)
  review actions
  update .git-blame-ignore-revs
  adopt codespell
  Adopt sphinx design (SciTools#5127)
  Bump scitools/workflows from 2023.04.2 to 2023.04.3 (SciTools#5253)
  refresh manual pypi publish instructions (SciTools#5252)
  Updated environment lockfiles (SciTools#5250)
  removed bugfix section
  Make bm_runner location agnostic and include debugging. (SciTools#5247)
  Restore latest Whats New files.
  SciTools#5220 typo github.repository_owner. (SciTools#5248)
  Whats new updates for v3.5.0rc0. (SciTools#5246)
  libnetcdf <4.9 pin (SciTools#5242)
  update cf standard units (SciTools#5244)
  Updated environment lockfiles (SciTools#5211)
  update ci locks location (SciTools#5228)
  Fixes to _discontiguity_in_bounds (attempt 2) (SciTools#4975)
  Finalises Lazy Data documentation (SciTools#5137)
  Modernize and simplify iris.analysis._Groupby (SciTools#5015)
  clarity on whatsnew entry contributors (SciTools#5240)
  ...
lbdreyer pushed a commit that referenced this pull request Apr 21, 2023
* Basic functional lazy saving.

* Simplify function signature which upsets Sphinx.

* Non-lazy saves return nothing.

* Now fixed to enable use with process/distributed scheduling.

* Remove dask.utils.SerializableLock, which I think was a mistake.

* Make DefferedSaveWrapper use _thread_safe_nc.

* Fixes for non-lazy save.

* Avoid saver error when no deferred writes.

* Reorganise locking code, ready for shareable locks.

* Remove optional usage of 'filelock' for lazy saves.

* Document dask-specific locking; implement differently for threads or distributed schedulers.

* Minor fix for unit-tests.

* Pin libnetcdf to avoid problems -- see #5187.

* Minor test fix.

* Move DeferredSaveWrapper into _thread_safe_nc; replicate the NetCDFDataProxy fix; use one lock per Saver; add extra up-scaled test

* Update lib/iris/fileformats/netcdf/saver.py

Co-authored-by: Bouwe Andela <bouweandela@gmail.com>

* Update lib/iris/fileformats/netcdf/_dask_locks.py

Co-authored-by: Bouwe Andela <bouweandela@gmail.com>

* Update lib/iris/fileformats/netcdf/saver.py

Co-authored-by: Bouwe Andela <bouweandela@gmail.com>

* Small rename + reformat.

* Remove Saver lazy option; all lazy saves are delayed; factor out fillvalue checks and make them delayable.

* Repurposed 'test__FillValueMaskCheckAndStoreTarget' to 'test__data_fillvalue_check', since old class is gone.

* Disable (temporary) saver debug printouts.

* Fix test problems; Saver automatically completes to preserve existing direct usage (which is public API).

* Fix docstring error.

* Fix spurious error in old saver test.

* Fix Saver docstring.

* More robust exit for NetCDFWriteProxy operation.

* Fix doctests by making the Saver example functional.

* Improve docstrings; unify terminology; simplify non-lazy save call.

* Moved netcdf cell-method handling into nc_load_rules.helpers, and various tests into more specific test folders.

* Fix lockfiles and Makefile process.

* Add unit tests for routine _fillvalue_report().

* Remove debug-only code.

* Added tests for what the save function does with the 'compute' keyword.

* Fix mock-specific problems, small tidy.

* Restructure hierarchy of tests.unit.fileformats.netcdf

* Tidy test docstrings.

* Correct test import.

* Avoid incorrect checking of byte data, and a numpy deprecation warning.

* Alter parameter names to make test reports clearer.

* Test basic behaviour of _lazy_stream_data; make 'Saver._delayed_writes' private.

* Add integration tests, and distributed dependency.

* Docstring fixes.

* Documentation section and whatsnew entry.

* Various fixes to whatsnew, docstrings and docs.

* Minor review changes, fix doctest.

* Arrange tests + results to organise by package-name alone.

* Review changes.

* Review changes.

* Enhance tests + debug.

* Support scheduler type 'single-threaded'; allow retries on delayed-save test.

* Improve test.

* Adding a whatsnew entry for 5224 (#5234)

* Adding a whatsnew entry explaining 5224

* Fixing link and format error

* Replacing numpy legacy printing with array2string and remaking results for dependent tests

* adding a whatsnew entry

* configure codecov

* remove results creation commit from blame

* fixing whatsnew entry

* Bump scitools/workflows from 2023.04.1 to 2023.04.2 (#5236)

Bumps [scitools/workflows](https://github.com/scitools/workflows) from 2023.04.1 to 2023.04.2.
- [Release notes](https://github.com/scitools/workflows/releases)
- [Commits](SciTools/workflows@2023.04.1...2023.04.2)

---
updated-dependencies:
- dependency-name: scitools/workflows
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Use real array for data of of small netCDF variables. (#5229)

* Small netCDF variable data is real.

* Various test fixes.

* More test fixing.

* Fix printout in Mesh documentation.

* Whatsnew + doctests fix.

* Tweak whatsnew.

* Handle derived coordinates correctly in `concatenate` (#5096)

* First working prototype of concatenate that handels derived coordinates correctly

* Added checks for derived coord metadata during concatenation

* Added tests

* Fixed defaults

* Added what's new entry

* Optimized test coverage

* clarity on whatsnew entry contributors (#5240)

* Modernize and simplify iris.analysis._Groupby (#5015)

* Modernize and simplify _Groupby

* Rename variable to improve readability

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Add a whatsnew entry

* Add a type hint to _add_shared_coord

* Add a test for iris.analysis._Groupby.__repr__

---------

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Finalises Lazy Data documentation (#5137)

* cube and io lazy data notes added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added comments within analysis, as well as palette and iterate, and what's new

* fixed docstrings as requested in @trexfeathers review

* reverted cube.py for time being

* fixed flake8 issue

* Lazy data second batch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated lastest what'snew

* I almost hope this wasn't the fix, I'm such a moron

* adressed review changes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bill Little <bill.james.little@gmail.com>

* Fixes to _discontiguity_in_bounds (attempt 2) (#4975)

* update ci locks location (#5228)

* Updated environment lockfiles (#5211)

Co-authored-by: Lockfile bot <noreply@github.com>

* Increase retries.

* Change debug to show which elements failed.

* update cf standard units (#5244)

* update cf standard units

* added whatsnew entry

* Correct pull number

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

---------

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* libnetcdf <4.9 pin (#5242)

* Pin libnetcdf<4.9 and update lock files.

* What's New entry.

* libnetcdf not available on PyPI.

* Fix for Pandas v2.0.

* Fix for Pandas v2.0.

* Avoid possible same-file crossover between tests.

* Ensure all-different testfiles; load all vars lazy.

* Revert changes to testing framework.

* Remove repeated line from requirements/py*.yml (?merge error), and re-fix lockfiles.

* Revert some more debug changes.

* Reorganise test for better code clarity.

* Use public 'Dataset.isopen()' instead of '._isopen'.

* Create output files in unique temporary directories.

* Tests for fileformats.netcdf._dask_locks.

* Fix attribution names.

* Fixed new py311 lockfile.

* Fix typos spotted by codespell.

* Add distributed test dep for python 3.11

* Fix lockfile for python 3.11

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bouwe Andela <bouweandela@gmail.com>
Co-authored-by: Henry Wright <84939917+HGWright@users.noreply.github.com>
Co-authored-by: Henry Wright <henrywright@sky.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Manuel Schlund <32543114+schlunma@users.noreply.github.com>
Co-authored-by: Bill Little <bill.james.little@gmail.com>
Co-authored-by: Bouwe Andela <b.andela@esciencecenter.nl>
Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>
Co-authored-by: Elias <110238618+ESadek-MO@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: stephenworsley <49274989+stephenworsley@users.noreply.github.com>
Co-authored-by: scitools-ci[bot] <107775138+scitools-ci[bot]@users.noreply.github.com>
Co-authored-by: Lockfile bot <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants