Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor significant portions of the codebase and refactor some land-atm handlers #213

Merged
merged 6 commits into from
Nov 10, 2023

Conversation

tomvothecoder
Copy link
Collaborator

@tomvothecoder tomvothecoder commented Sep 15, 2023

Overview

This PR was originally supposed only address #115, but I found the codebase difficult to work with so I decided to refactor it.

Regression Testing

TLDR: All of the datasets produced by this branch align with the master branch. The max relative differences between 5/109 datasets are insignificant and can be attributed to floating point rounding error produced by Xarray vs. CDAT, which is fine.


Setup:

  1. Generated branch datasets using scripts/branch-regression-testing/115-cdat-refactor-test/115-end-to-end-script.sh to compare against master branch datasets.
  2. Compared datasets between branches using scripts/branch-regression/115-comparison-notebook.ipynb

Results:

  • 104/109 are identical
  • 4/109 are close and meet threshold of rtol=1e-7 (cl clw, cli, pfull)
  • 1/109 is not close (mrso)
    • mrso is fine (max relative difference of 6.726486e-07)
      mrso
      -----
      MEAN
          dev: 998.3673095703125, master: 998.3673095703125
      SUM
          dev: 277705856.0, master: 277705856.0
      COMPARISON
      
      Not equal to tolerance rtol=1e-07, atol=1e-07
      
      Mismatched elements: 25556 / 777600 (3.29%)
      Max absolute difference: 0.00073242
      Max relative difference: 6.726486e-07
       x: array([[[nan, nan, nan, ..., nan, nan, nan],
              [nan, nan, nan, ..., nan, nan, nan],
              [nan, nan, nan, ..., nan, nan, nan],...
       y: array([[[nan, nan, nan, ..., nan, nan, nan],
              [nan, nan, nan, ..., nan, nan, nan],
              [nan, nan, nan, ..., nan, nan, nan],...

Summary of Changes

Core Changes

  1. Refactor lib.py (now deleted)
    • Closes Refactor lib.py utilities to reduce code duplication and excessive nesting of function calls #190
    • Replace run_parallel() and run_serial() with __main__.py -> E3SMtoCMIP._run_parallel() and
      E3SMtoCMIP._run_serial()
      • Updated var name filepaths to vars_to_filepaths for clarity (its a dict mapping var key to list of string paths)
      • Update args passed to handler method, remove unnecessary args
      • Add try and except statement for submitting pool jobs to maintain compatibility with MPAS variable handlers, which use different handler method arguments
    • Replace handle_variables(), get_dimension_data() and load_axis() with VarHandler.cmorize()
    • Delete handle_simple() -- will be re-implemented from scratch in Fix --simple mode not working #130
  2. Update handler.py
    • Update VarHandler.cmorize()
      • Refactored significantly -- extracted logic into smaller, maintainable private methods
      • Replaced data dictionary storing xr.DataArrays with ds (xr.Dataset object)
      • Distinguish CMOR usage with "cmor" string in function names and python variables
      • Add support for hybrid sigma variables: pfull, phalf
    • Add VarHandler._cmor_write()
  3. Update handlers.yaml
  4. Update _formulas.py
    • Update all function argument types and return types from np.ndarray to xr.DataArray
    • Add formula functions for pfull and phalf
    • Add convert_units() function, which handles 1-to-1 unit conversions -- replaces default.default_handlers.write_data()
  5. Refactor default.py (now deleted)
    • Replaced by VarHandler.cmorize() and _formulas.py.convert_units()

Clean Up Changes

  1. Remove cdms2 and cdutil from dependencies in dev.yml and ci.yml7
  2. Clean up legacy code in clisccp.py
  3. Add Makefile for easy access to commonly used commands (e.g., building and installing package)

@tomvothecoder tomvothecoder added the enhancement New feature or request label Sep 15, 2023
@tomvothecoder tomvothecoder self-assigned this Sep 15, 2023
@tomvothecoder tomvothecoder changed the title Add support for hybrid variables (pfull and phalf) in the VarHandler class Add support for phalf and pfull in the VarHandler class and refactor lib.py. Sep 20, 2023
@tomvothecoder tomvothecoder changed the title Add support for phalf and pfull in the VarHandler class and refactor lib.py. Add support for phalf and pfull in the VarHandler class and refactor lib.py Sep 20, 2023
@tomvothecoder tomvothecoder changed the title Add support for phalf and pfull in the VarHandler class and refactor lib.py Refactor lib.py functions and add support for pfull/phalf variable handlers Sep 21, 2023
@tomvothecoder tomvothecoder changed the title Refactor lib.py functions and add support for pfull/phalf variable handlers Refactor significant portions of the codebase and add support for pfull/phalf variable handlers Sep 21, 2023
@tomvothecoder tomvothecoder changed the title Refactor significant portions of the codebase and add support for pfull/phalf variable handlers Refactor significant portions of the codebase and refactor pfull, phalf, clcalipso, and clisccp variable handlers Sep 26, 2023
@tomvothecoder tomvothecoder changed the title Refactor significant portions of the codebase and refactor pfull, phalf, clcalipso, and clisccp variable handlers Refactor significant portions of the codebase and refactor time-invariant land-atm handlers Oct 30, 2023
@tomvothecoder tomvothecoder force-pushed the refactor/115-cdat branch 2 times, most recently from 19c8092 to 02fd9a4 Compare October 30, 2023 17:31
@tomvothecoder
Copy link
Collaborator Author

Moved the section below out of the description to make the description shorter.

Technical Debt in this Repo

I've noticed many Python anti-patterns and "not good" (bad) coding practices used in this repository. This has resulted in a lot of technical debt for new developers of this repo.

  • A lot of copy-pasting code -- e.g., legacy Python modules for variable handlers, same logic in handle_variables()/handle_simple()/default_handler()
  • Functions using a large number of arguments, especially keyword arguments with default values (example) (and no API docstrings or type annotations for any of these functions).
  • Using **kwargs to pass around function arguments (example) -- really fragile and hard to debug
  • Extremely large functions that should be broken up into smaller, maintainable functions
  • No unit testing excepts the ones I implemented :(
  • Functions and data structures are unnecessarily nested in one another (example, example)
  • Excessive use of dictionaries to store handler information, variable and axes information, etc. (example)

@tomvothecoder tomvothecoder marked this pull request as ready for review October 31, 2023 22:34
@tomvothecoder tomvothecoder changed the title Refactor significant portions of the codebase and refactor time-invariant land-atm handlers Refactor significant portions of the codebase and refactor some land-atm handlers Nov 1, 2023
1. Refactor `lib.py` (now deleted)
   - Closes #190
   - Replace `run_parallel()` and `run_serial()` with `__main__.py` -> `E3SMtoCMIP._run_parallel()` and
`E3SMtoCMIP._run_serial()`
     - Updated var name `filepaths` to `vars_to_filepaths` for clarity (its a dict mapping var key to list of string paths)
     - Update args passed to handler method, remove unnecessary args
     - Add `try` and `except` statement for submitting `pool` jobs to maintain compatibility with MPAS variable handlers, which use different handler method arguments
   - Replace `handle_variables()`, `get_dimension_data()` and `load_axis()` with `VarHandler.cmorize()`
   - Delete `handle_simple()` -- _will be re-implemented from scratch in #130_
3. Update `handler.py`
   - Update `VarHandler.cmorize()`
      - Refactored significantly -- extracted logic into smaller, maintainable private methods
      - Replaced `data` dictionary storing `xr.DataArrays`  with `ds` (`xr.Dataset` object)
	  - Distinguish CMOR usage with "cmor" string in function names and python variables
	  - Add support for hybrid sigma variables: `pfull`, `phalf`
   - Add `VarHandler._cmor_write()`
     - Add support for CMORizing fixed-time variables
       - Separate PR for #217
     - If time dimension exists, CMORize the variable with all time and time bound values with a single call to `cmor.write()` instead of looping over each time value index and CMORizing each slice -- **this should improve performance and removes the `tqdm` progress bar.**
4. Update `handlers.yaml`
    - Add `phalf` and `pfull` entries
      - Closes #115
      - Delete `phalf.py` and `pfull.py`
    - Add `clcalispo` entry
      - Closes #218
      - Delete `clcalipso.py`
5. Update `_formulas.py`
   - Update all function argument types and return types from `np.ndarray` to `xr.DataArray`
   - Add formula functions for `pfull` and `phalf`
   - Add `convert_units()` function, which handles 1-to-1 unit conversions -- replaces `default.default_handlers.write_data()`
6. Refactor `default.py` (now deleted)
   - Replaced by `VarHandler.cmorize()` and `_formulas.py.convert_units()`

7. Remove `cdms2` and `cdutil` from dependencies in `dev.yml` and `ci.yml`7
8.  Clean up legacy code in `clisccp.py`
    - Separate PR for #218
9. Add `Makefile` for easy access to commonly used commands (e.g., building and installing package)
@tomvothecoder
Copy link
Collaborator Author

Hi @chengzhuzhang, this PR is finally ready for review! I performed thorough regression testing and everything checks out.

The PR description includes the core changes that you might want to focus on. Thanks!

Comment on lines 438 to 440
def _reshape_single_time_bnd(self, ds):
# TODO: Reshape time bounds if it exists and has a length of 1: https://github.com/E3SM-Project/e3sm_to_cmip/blob/8c818fcb6d51cc0555f62b8058eee539b69a9579/e3sm_to_cmip/lib.py#L674-L678C42
pass
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this TODO just in case it is needed. I didn't run into any issues where time bounds have a length of 1 and need to be reshaped like the old CDAT code linked in the comment.

Copy link
Collaborator Author

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last self-review

e3sm_to_cmip/cmor_handlers/handler.py Outdated Show resolved Hide resolved
conda-env/ci.yml Outdated Show resolved Hide resolved
@chengzhuzhang chengzhuzhang self-requested a review November 9, 2023 21:00
name = "ps"

# NOTE: This maintains the legacy name from pfull.py and phalf.py. I'm
# not sure why this is done and what the difference is with "ps".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. I don't have a good idea of this either.

Copy link
Collaborator

@chengzhuzhang chengzhuzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder The PR looks good to me. Thanks for bringing in these nice changes and carefully tested the changes. The code base is much cleaner now.

@tomvothecoder
Copy link
Collaborator Author

Thanks @chengzhuzhang! I will merge this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
2 participants