Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of neon_site into tower_site and neon_site #2363

Merged
merged 13 commits into from
Mar 19, 2024

Conversation

TeaganKing
Copy link
Contributor

@TeaganKing TeaganKing commented Feb 13, 2024

Description of changes

This PR will create a TowerSite class which defines common functionalities that are in both NeonSite and Plumber2Site classes. With this structure, we will be able to implement a Plumber2Site class parallel to the NeonSite class that utilizes structure and functionalities in the parent classes.

Contributors other than yourself, if any:

@ekluzek
@adrifoster

CTSM Issues Fixed (include github issue #):

This will address part of #1487
Some work on #2405

Are answers expected to change (and if so in what way)?

No, this PR should be BFB.

Any User Interface Changes (namelist or namelist defaults changes)?

No

Note: I updated the branch I was using, so this is a continuation of the work done in #2344

@TeaganKing TeaganKing added enhancement new capability or improved behavior of existing capability PR status: work in progress bfb bit-for-bit labels Feb 13, 2024
@TeaganKing TeaganKing self-assigned this Feb 13, 2024
@samsrabin samsrabin changed the base branch from master to b4b-dev February 13, 2024 22:13
@samsrabin
Copy link
Collaborator

Just a heads up that I've moved this to point at the new b4b-dev branch instead of master. This should allow it to avoid backlog and get merged sooner once it's ready.

@TeaganKing
Copy link
Contributor Author

TeaganKing commented Feb 20, 2024

@ekluzek @adrifoster per our conversation last week, I'm adding you both as reviewers! Note that some changes may be made to pull out NEON-specific or PLUMBER-specific features once the PLUMBER class creation is underway.

@TeaganKing TeaganKing changed the title [WIP] Refactoring of neon_site into tower_site and neon_site Refactoring of neon_site into tower_site and neon_site Feb 20, 2024
@TeaganKing TeaganKing marked this pull request as ready for review February 22, 2024 19:13
Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool this is great. Nice start here!

I have a few questions we'll go over in a bit. I'd also like to know how the pylint does with this? And is:

cd python
make all

Clean?

Some of the methods are long and could maybe be broken up to be shorter and also to make them unit-testable. That could also be a future refactoring. I added some notes about that.

I also added some notes about things that look like they could be unit-testable. They might already be, but wanted to ask.

It'll be good to go over this in a bit...

bld/namelist_files/namelist_defaults_ctsm.xml Outdated Show resolved Hide resolved
cime_config/buildnml Show resolved Hide resolved
python/ctsm/site_and_regional/neon_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/neon_site.py Outdated Show resolved Hide resolved
python/ctsm/site_and_regional/neon_site.py Outdated Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
@TeaganKing
Copy link
Contributor Author

There are some tests for modifying user namelists in CTSM/python/ctsm/test/test_unit_neon_site.py.
Some additional discussion on unit tests' current implementation is in #2315.

I have addressed other comments that relate to what we wanted to do now.

Regarding 'make all', I'm hoping to test this but am currently out of scratch space and am thus running into errors... Fred is making room elsewhere for a simulation that's taking up the majority of that space, so hopefully I can test this soon.

@TeaganKing
Copy link
Contributor Author

TeaganKing commented Feb 26, 2024

Quick update, I'm currently running into some errors with test_sys_run_neon.TestSysRunNeon in make all but not with python3 ./run_ctsm_py_tests --sys. In Sam Rabin's comment mentioned here, changing directories out of the temp dir resolved this issue previously; however, that doesn't seem to be resolving the issue in this case. I am thinking this may be a result of building in tower_site instead of neon_site, but I still need to look into that in more detail to confirm.

The following pylint errors are also present:

************* Module ctsm.crop_calendars.process_ggcmi_shdates
ctsm/crop_calendars/process_ggcmi_shdates.py:450:4: C0206: Consider iterating with .items() (consider-using-dict-items)
ctsm/crop_calendars/process_ggcmi_shdates.py:501:8: C0206: Consider iterating with .items() (consider-using-dict-items)
************* Module ctsm.crop_calendars.grid_one_variable
ctsm/crop_calendars/grid_one_variable.py:25:16: R1735: Consider using {} instead of dict() (use-dict-literal)
************* Module ctsm.test.test_unit_singlept_data_surfdata
ctsm/test/test_unit_singlept_data_surfdata.py:489:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)
ctsm/test/test_unit_singlept_data_surfdata.py:1051:8: E1137: "self.ds_test['PCT_URBAN']" does not support item assignment (unsupported-assignment-operation)

@ekluzek
Copy link
Collaborator

ekluzek commented Feb 26, 2024

@TeaganKing do you need some help with this or can you continue to figure this out? Maybe we could meet say Friday sometime?

The changing out of temp-dir thing you mention is sometimes tricky because you have to do it on the test that's before the one where it breaks. So it might appear in the files you worked with, but it really originates in whatever test was done before yours. Hopefully, that makes sense enough to fix it, but otherwise let's meet later...

@TeaganKing
Copy link
Contributor Author

TeaganKing commented Feb 26, 2024

@ekluzek It does look like the temp dir that's causing errors was set up during test_sys_run_neon.test_one_site, which makes me think that the issue is actually during this step? I'll look into this more tomorrow and if I'm still stuck, I'll reach out for help on Friday. Thanks!

@TeaganKing
Copy link
Contributor Author

A few additional notes on the make all test error:

  • Running python run_ctsm_pytests -s worked, but make all failed.
  • Commenting out test_sys_run_neon.test_one_site results in all tests passing; so the error must be encountered during that particular test rather than as a result of other tests.
  • This error is not encountered when running neon at a particular site, eg with ./run_neon --neon-sites DEJU, the model builds and runs successfully.
  • Setting up real (permanent throughout the test) directories instead of temporary directories (that are removed) in test_sys_run_neon also did not resolve the issue.

Error details:
Occurs after this expected message:

---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---

Error:

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 704, in run_sub_or_cmd
    getattr(mod, subname)(*subargs)
  File "/glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial", line 74, in buildlib
    run_bld_cmd_ensure_logging(cmd, logger)
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2611, in run_bld_cmd_ensure_logging
    expect(stat == 0, filter_unicode(errput))
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: cat: Filepath: No such file or directory
/glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/Tools/Makefile:195: *** NETCDF not found: Define NETCDF_PATH or NETCDF_C_PATH and NETCDF_FORTRAN_PATH in config_machines.xml or cmake_macros.  Stop.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/work/tking/temp/CTSM/python/ctsm/test/test_sys_run_neon.py", line 61, in test_one_site
    main("")
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/run_neon.py", line 227, in main
    cesmroot, output_root, res, compset, user_mods_dirs, overwrite, setup_only
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/neon_site.py", line 51, in build_base_case
    case_path = super().build_base_case(cesmroot, output_root, res, compset, user_mods_dirs)
  File "/glade/work/tking/temp/CTSM/python/ctsm/site_and_regional/tower_site.py", line 154, in build_base_case
    build.case_build(case_path, case=case)
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1329, in case_build
    return run_and_log_case_status(functor, cb, caseroot=caseroot)
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 2477, in run_and_log_case_status
    rv = func()
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1320, in <lambda>
    dry_run,
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 1212, in _case_build_impl
    complist,
  File "/glade/work/tking/temp/CTSM/cime/CIME/build.py", line 840, in _build_libraries
    logfile=file_build,
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 716, in run_sub_or_cmd
    expect(False, "{} FAILED, cat {}".format(cmd, logfile))
  File "/glade/work/tking/temp/CTSM/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: /glade/work/tking/temp/CTSM/cime/CIME/build_scripts/buildlib.mpi-serial FAILED, cat /glade/derecho/scratch/tking/tmp/tmp7i5f_fyy/BART/bld/mpi-serial.bldlog.240227-101229

@ekluzek I will reach out to you for some help sorting out these details!

@TeaganKing
Copy link
Contributor Author

I wanted to note a few more thoughts on these test failures after meeting with @ekluzek earlier today.

  1. We noticed that the tests were succesful prior to some derecho updates / machine maintenance, so this could possibly be tied to the updates to a newer version of CIME.
  2. Case.build within the test case directory works fine; this shows that the case setup is done as expected, but for some reason the method of calling case build from within run_neon with the test setup seems to be failing.
  3. We considered that there could be an absolute path error; however, using the same absolute paths as in the test and running by hand is successful (eg, /glade/u/home/tking/work/temp/CTSM/tools/site_and_regional/run_neon --neon-sites BART --setup-only --output-root /glade/derecho/scratch/tking/tmp/temp ).
  4. Looking into Makefile,in the python directory it looks pretty straightforward and does not change directories or do anything that would affect a full path.
  5. Note that make all runs make utest and make stest (the latter being the relevant part to this issue). make stest fails.
  6. ./run_ctsm_py_tests -s succeeds and uses python3 but through the shebang line
  7. Running module load nco and then python3 ./run_ctsm_py_tests --sys is also successful (which is essentially what make stest does).

This issue is probably tied to details regarding how our testing is performed, and may require some additional brainstorming with folks who are more familiar with our testing suite.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 5, 2024

I replicated the above behavior in my own sandbox. So it's reproducible. I don't know what might be different inside the Makefile environment than outside, which makes this strange.

A previous problem we've run into is that we are removing directory trees, but not making sure that isn't the current directory. So I tried making that update to other places, and it acts the same way. I do see something that does need to be fixed in the main code that I'll add a comment for. I'm doubtful that's the issue, but it is something good to fix.

@TeaganKing
Copy link
Contributor Author

TeaganKing commented Mar 5, 2024

I did also notice that with a fresh CTSM checkout, if I use the tower_site.py and other files that I've changed, this fails, but if I use the old files, 'make all' does not fail... So this probably is also a result of splitting up run_neon into multiple classes/files. @ekluzek when you say you tested this in your own sandbox, I assume that is with these code changes specifically rather than the current CTSM code?

Specifically moving build_base_case back into neon_site.py results in successful tests.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 5, 2024

@TeaganKing yes exactly. It fails in my sandbox where I checked out your branch, but it passes in master or b4b-dev.

I don't think it's really about anything that you did, but something more subtle in the testing. What I'd like to do is to see if we can step through what's going on with a python debugger, or add more print statements to each thing going on to see how the Makefile environment is different from running directly. One guess I have is that Maybe there are different environment variable settings when run from inside the Makefile, than when run outside of make. Just checked that and the Makefile adds the following env settings:

> MAKEFLAGS=
> MAKELEVEL=1
> MAKE_TERMERR=/dev/pts/1
> MAKE_TERMOUT=/dev/pts/1
144a149
> MFLAGS=

Which doesn't look like it should matter...

Sometimes when new code is added the code can be fine, but it exposes an existing problem. This is rare, but when it happens it becomes really hard to track down.

@TeaganKing
Copy link
Contributor Author

In a discussion at the CTSM software meeting, we decided to remove the make stest and make utest components of the makefile due to redundancy with the python tests, so we will not resolve the make all issue mentioned above. The makefile changes will be done in a separate PR. The python tests were successful, so this should be good to merge in. @ekluzek I made some comments to your suggestions above.

@TeaganKing
Copy link
Contributor Author

Thanks Erik. Just to put everything in one place, I will describe all of the unresolved conversations here.

Testing-related issues which we decided to leave as is for now:
In neon_site.py, we could add an error check that the user_mods_dirs directory is valid.
TowerSite.str could be unit tested, although it's probably not essential at the moment.
TowerSite.get_batch_query and set_ref_case could be unit tested.

Testing outside the scope of this work:
We should generally add a utility for removing a directory tree since this problem seems to be repeatedly causing errors. This should be a new issue ticket that will probably also be relevant in other parts of CTSM, in addition to TowerSite.build_base_case and TowerSite.run_case.

Further Refactoring:
The TowerSite.build_base_case section that starts with with Case(case_path, read_only=False) as case: could be brought out into a separate method. run_case should also be broken down further in the future.

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 6, 2024

Cool, thanks @TeaganKing. I want to get this into b4b-dev so it'll be on the next tag there that we move to main. Which means I need to get it on by tomorrow. So I'll just open those issues talked about above, and close the conversations and all that.

I also did some work on the branch to deal with the temporary directory issue, so I'll check that in. It didn't solve it, but should improve the future. So I might as well check that in here.

Thanks for the work on this. Looking forward to the next piece here.

@TeaganKing
Copy link
Contributor Author

Awesome, thank you Erik!

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved all the conversations, so this can be merged in.

python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/neon_site.py Show resolved Hide resolved
python/ctsm/site_and_regional/tower_site.py Show resolved Hide resolved
 Conflicts:
	.git-blame-ignore-revs
	python/ctsm/site_and_regional/neon_site.py

Also moved changes on HEAD in neon_site to here..
        python/ctsm/site_and_regional/tower_site.py
… fix the conflict that ended up being committed, and make sure abort is added
@ekluzek
Copy link
Collaborator

ekluzek commented Mar 19, 2024

python testing is good and clm_pymods tests on Derecho pass. So merging in.

@ekluzek ekluzek merged commit d866132 into ESCOMP:b4b-dev Mar 19, 2024
2 checks passed
@ekluzek ekluzek deleted the towerclass branch March 19, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfb bit-for-bit enhancement new capability or improved behavior of existing capability
Projects
Status: Done (non release/external)
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants