Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified rc12 testing: np.int error on Perlmutter #484

Closed
forsyth2 opened this issue Aug 17, 2023 · 26 comments
Closed

Unified rc12 testing: np.int error on Perlmutter #484

forsyth2 opened this issue Aug 17, 2023 · 26 comments
Labels
semver: bug Bug fix (will increment patch version)

Comments

@forsyth2
Copy link
Collaborator

Running into the following on global_time_series on Perlmutter:

Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_18\
50-1860_dir/coupled_global.py", line 603, in <module>
    run(sys.argv)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_18\
50-1860_dir/coupled_global.py", line 587, in run
    PLOT_DICT[plot_list[i]](ax, xlim, exps)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_18\
50-1860_dir/coupled_global.py", line 295, in plot_max_moc
    plot(ax, xlim, exps, param_dict)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_18\
50-1860_dir/coupled_global.py", line 374, in plot
    [year, var] = getmoc(exp["moc"])
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_18\
50-1860_dir/coupled_global.py", line 29, in getmoc
    for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/numpy\
/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this \
will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to speci\
fy the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'inf'?

This np.int issue was resolved by https://github.com/E3SM-Project/zppy/pull/466/files#diff-a10da0e74d3b76ccfa4c61b3666ceb36b458e72e82820d85c357507070820badL29, so it shouldn't be an issue in Unified rc12...

@forsyth2 forsyth2 added semver: bug Bug fix (will increment patch version) Testing Files in `tests` modified and removed Testing Files in `tests` modified labels Aug 17, 2023
@forsyth2
Copy link
Collaborator Author

/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_login/lib/python3.10/site-packages/zppy/templates/coupled_global.py:
for iyear in range(int(time0[0]), int(time0[-1]) + 1):

/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py:
for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):

@forsyth2
Copy link
Collaborator Author

Compare to Compy:

/compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py:
for iyear in range(int(time0[0]), int(time0[-1]) + 1):

@forsyth2
Copy link
Collaborator Author

$ conda list zppy
# packages in environment at /global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_login:
#
# Name                    Version                   Build  Channel
zppy                      2.3.0rc5           pyh51c0ceb_0    conda-forge/label/zppy_dev

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 17, 2023

I'm looking at the source code in rc12.

/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/zppy/templates/coupled_global.py

The script is up-to-date.

It seem like we can rule out that:

  • zppy is executed in an old e3sm-unified rc
  • The coupled_global.py script in run directory is not newly generated.
    Then I'm out of ideas.

I guess, it is worthwhile to try create an result empty directory and re-test the run.

@forsyth2
Copy link
Collaborator Author

Thanks @chengzhuzhang. Yeah, I'll try re-running.

@xylar
Copy link
Contributor

xylar commented Aug 18, 2023

@forsyth2, please keep me posted on this. I can't find anywhere in e3sm_diags or zppy that np.int is mentioned. This is the last issue I know of before we can release E3SM-Unified.

@forsyth2
Copy link
Collaborator Author

I moved the post directory from the previous run and reran. Still running into the same issue.

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
global_time_series_1850-1860.status:ERROR (5)
$ tail -n 17 global_time_series_1850-1860.o14019357 
Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 603, in <module>
    run(sys.argv)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 587, in run
    PLOT_DICT[plot_list[i]](ax, xlim, exps)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 295, in plot_max_moc
    plot(ax, xlim, exps, param_dict)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 374, in plot
    [year, var] = getmoc(exp["moc"])
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 29, in getmoc
    for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'inf'?

@xylar
Copy link
Contributor

xylar commented Aug 18, 2023

@forsyth2, can you post your exact set of commands that is causing this error? Starting with cloning the zppy repo and creating a conda environment if that is part of the workflow? It feels to me like the testing process itself might be messing things up and bringing in the incorrect template from somewhere else.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 18, 2023

@xylar I'm following the directions at https://github.com/E3SM-Project/zppy/pull/440/files#diff-b84f28870c999d293db6137dfb571c070d7e3b9ccdb1b8263e32a3c3e45f8dd6 (docs/source/dev_guide/pre_release_testing.rst):

# Step 1
# Log into Perlmutter

# Step 2
$ cd /global/homes/f/forsyth/zppy # Repo already exists

# Step 3
$ git fetch upstream main
$ git checkout -b test_unified_rc12 upstream/main
$ git log # check the commits match https://github.com/E3SM-Project/zppy/commits/main
# Most recent: https://github.com/E3SM-Project/zppy/commit/49933f97cd80ab81c436a16bdf85fa9d114f9690 (bump to rc5)

# Step 4
$ cd ../e3sm_diags
$ git checkout main
$ git fetch upstream
$ git reset --hard upstream/main
$ git log # Should match https://github.com/E3SM-Project/e3sm_diags/commits/main
# Most recent: https://github.com/E3SM-Project/e3sm_diags/commit/6aaa7c27dcf65393dd07e258ef2e7da21a660e87 (bump to rc3)
$ mamba clean --all
$ mamba env create -f conda-env/dev.yml -n e3sm_diags_20230816
$ conda activate e3sm_diags_20230816
$ pip install .

# Step 5
$ cd ../zppy
$ emacs tests/integration/utils.py
# 1. Confirm last line includes `generate_cfgs(unified_testing=True)`
# Under `def get_perlmutter_expansions(config):`, set:
# 2. "diags_environment_commands": "source /global/homes/f/forsyth/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20230816"
# 3. "environment_commands_test": "source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh",

# Step 6
# Use Unified rc12 rather than zppy dev environment:
$ source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh

$ Step 7
$ python -u -m unittest tests/test_*.py
# Unit tests pass

$ Step 8 (this is the step I repeat to re-test)
# Follow directions at https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/directions_pm-cpu.md:
$ rm -rf /global/cfs/cdirs/e3sm/www/forsyth/zppy_test_complete_run_www/v2.LR.historical_0201 # Or move somewhere else
$ rm -rf /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post # Or move somewhere else
$ python tests/integration/utils.py # only did the first time. Have a hard-coded time increase for MPAS-Analysis at the moment
$ zppy -c tests/integration/generated/test_complete_run_pm-cpu.cfg
# [wait for this to run]
$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
global_time_series_1850-1860.status:ERROR (5)
$ tail -n 17 global_time_series_1850-1860.o14019357 
Traceback (most recent call last):
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 603, in <module>
    run(sys.argv)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 587, in run
    PLOT_DICT[plot_list[i]](ax, xlim, exps)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 295, in plot_max_moc
    plot(ax, xlim, exps, param_dict)
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 374, in plot
    [year, var] = getmoc(exp["moc"])
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 29, in getmoc
    for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'inf'?

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 19, 2023

Thanks @xylar for the suggestion for @forsyth2 to share the testing steps. I'm trying to locate the central piece, which is configuration file tests/integration/generated/test_complete_run_pm-cpu.cfg (link)on the branchtest_unified_rc12.

It has environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.0rc9_chrysalis.sh" in [default] session. I'm wondering if this is why coupled_global.py had numpy.int which was only in older rcs of zppy.

@chengzhuzhang
Copy link
Collaborator

The test_unified_rc12 branch might be outdated, but in tests/integration/utils.py only def get_compy_expansions(config) has the latest diags and unified environments

@xylar
Copy link
Contributor

xylar commented Aug 19, 2023

@forsyth2,

I tried out your steps but I don't have permission to access:

/global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201

so I get:

ncclimo: ERROR specified input directory "/global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/archive/atm/hist" does not exist

and similar.

I don't the this relates to the problem but I don't understand the need for step 4. Don't you want to be testing e3sm_diags from E3SM-Unified 1.9.0rc12, too, not from e3sm_diags/main? In this particular situation I think they're the same but there could be differences between the latest e3sm_diags RC and its main, and there could also be a difference between its conda package and installing it from source with pip. We want to test with the production environment wherever possible.

I will keep digging...

@xylar
Copy link
Contributor

xylar commented Aug 19, 2023

@chengzhuzhang, I think @forsyth2's instructions above involve modifying tests/integration/utils.py, which will then modify tests/integration/generated/test_complete_run_pm-cpu.cfg. I think the issue is that different branches on different machines step on each other's toes so the branch on the main zppy repo isn't necessarily the one @forsyth2 is using for testing on Perlmutter.

@xylar
Copy link
Contributor

xylar commented Aug 19, 2023

@forsyth2, can you run a simple, manual test that just involves a simple config file for zppy and no zppy or e3sm_diags repo? For example, can you log in to a fresh terminal, copy your tests/integration/generated/test_complete_run_pm-cpu.cfg somewhere else and use it to run zppy just using

source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh

for e3sm_diags and any other environment you need?

Update: Hmm, I don't actually see any reference to the e3sm_diags environment in test_complete_run_pm-cpu.cfg, so that must get used somewhere else?

@xylar
Copy link
Contributor

xylar commented Aug 19, 2023

@forsyth2, what do you see when you run:

> grep templateDir global_time_series_1850-1860.settings
  'templateDir': '/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_login/lib/python3.10/site-packages/zppy/templates',

from the zppy_test_bundles_output/v2.LR.historical_0201/post/scripts directory? As you can see, I can see that the templates come from the expected place where zppy in installed in e3sm_unified_1.9.0rc12_login. I can also see that there are no np.int() calls in:

/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/zppy/templates/coupled_global.py

@chengzhuzhang
Copy link
Collaborator

I tried to test this zppy run on perlmutter but don't have permission for input data: --input=/global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/archive/atm/hist

@chengzhuzhang
Copy link
Collaborator

Posting for @forsyth2 (he's having computer issues this morning)

  • yes, the testing branches differ between machines so what’s on GitHub might not match up
  • the reason for the E3SM Diags environment is for the one test that specifically tests “environment_commands”. The rest of the sub tasks use the latest release

@xylar
Copy link
Contributor

xylar commented Aug 21, 2023

Thanks, that sounds reasonable.

Please let me know if any progress gets made on tracking down where the np.int() is sneaking in from.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 21, 2023

I'm trying to locate the central piece, which is configuration file tests/integration/generated/test_complete_run_pm-cpu.cfg (link)on the branchtest_unified_rc12.

  1. The branch test_unified_rc12 is for Testing fixes for Compy #486 (testing code specific to Compy). I named the test-fixes branches on all the machines test_unified_rc12, so I'm going to need to rename them if I want a pull request for each up at the same time. I made a comment (https://github.com/E3SM-Project/zppy/pull/440/files#r1298974933) to update the release-testing docs to specify that the branch name should include the machine.

The test_unified_rc12 branch might be outdated, but in tests/integration/utils.py only def get_compy_expansions(config) has the latest diags and unified environments

  1. Yes, since test_unified_rc12 really should have been named test_unified_rc12_compy, only Compy's parameter dictionary was updated on that branch.

I don't have permission to access

  1. I just ran chmod -R o+r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201, so you should be have read-access now.

I don't understand the need for step 4

  1. Step 4 is specifically for the subtask that checks environment_commands is working properly. That subtask is [[ atm_monthly_180x360_aave_environment_commands ]], which includes environment_commands = "#expand diags_environment_commands#". See template file: https://github.com/E3SM-Project/zppy/blob/main/tests/integration/template_complete_run.cfg#L96. The other E3SM subtasks all use whatever is set for environment_commands for the rest of the test, which in this case is the "environment_commands_test": "source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh" specified in tests/integration/utils.py.

I think the issue is that different branches on different machines step on each other's toes so the branch on the main zppy repo isn't necessarily the one @forsyth2 is using for testing on Perlmutter.

  1. Yes -- see points 1,2 above. It's not so much that the changes are mutually exclusive as it is that on Chrysalis I make the changes Chrysalis needs, on Compy I make the changes Compy needs, and on Perlmutter I make the changes Perlmutter needs. Really, it could all be one branch but it would get annoying to keep it updated amongst three machines.

can you run a simple, manual test that just involves a simple config file for zppy

  1. Yes, I'll try running a simplified cfg on Perlmutter.

I don't actually see any reference to the e3sm_diags environment in test_complete_run_pm-cpu.cfg, so that must get used somewhere else?

  1. From the version on main: https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/test_complete_run_pm-cpu.cfg#L95:
  [[ atm_monthly_180x360_aave_environment_commands ]]
  environment_commands = "source /global/homes/f/forsyth/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20230728"

what do you see when you run

# Complete-run test directory
$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep templateDir global_time_series_1850-1860.settings
  'templateDir': '/global/homes/f/forsyth/.local/lib/python3.10/site-packages/zppy/templates',

# Bundles test directory
$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts
$ grep templateDir global_time_series_1850-1860.settings
  'templateDir': '/global/homes/f/forsyth/.local/lib/python3.10/site-packages/zppy/templates',

$ grep -n "np.int" /global/homes/f/forsyth/.local/lib/python3.10/site-packages/zppy/templates/coupled_global.py 
29:        for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):

Indeed, this does not match yours.... I'm not quite sure how this is happening.

don't have permission for input data

  1. See point 3 above. You should be able to access now.

Please let me know if any progress gets made on tracking down where the np.int() is sneaking in from.

  1. See point 8.

@forsyth2
Copy link
Collaborator Author

I just ran chmod -R o+r /global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201, so you should be have read-access now.

I also ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/ (#485 (comment)).

Yes, I'll try running a simplified cfg on Perlmutter.

$ cd /global/homes/f/forsyth/zppy_test
$ source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh
$ zppy -c test_np_int.cfg
$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_np_int_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
global_time_series_1850-1860.status:ERROR (5)
$ tail global_time_series_1850-1860.o14204061 
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_np_int_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 374, in plot
    [year, var] = getmoc(exp["moc"])
  File "/global/cfs/cdirs/e3sm/forsyth/zppy_test_np_int_output/v2.LR.historical_0201/post/scripts/global_time_series_1850-1860_dir/coupled_global.py", line 29, in getmoc
    for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'inf'?
$ grep "templateDir" global_time_series_1850-1860.settings
  'templateDir': '/global/homes/f/forsyth/.local/lib/python3.10/site-packages/zppy/templates',
$ grep -n "np.int" /global/homes/f/forsyth/.local/lib/python3.10/site-packages/zppy/templates/coupled_global.py 
29:        for iyear in range(np.int(time0[0]), np.int(time0[-1]) + 1):

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Aug 22, 2023

From my zppy run on Perlmutter:

  1. global_time_series task completed successfully. I can't reproduce the nump.int error. At this point, @forsyth2 I think the problem might be induced by your local environment...
  2. I can confirm that the e3sm_diags runs completed in 2 hours
real    51m28.987s
user    269m16.914s
sys     22m36.063s

It is indeed odd that only Chrysalis has longer than usual e3sm_diags run time.

@xylar
Copy link
Contributor

xylar commented Aug 22, 2023

@forsyth2, the fact that you are getting python packages from:

/global/homes/f/forsyth/.local/lib/python3.10/site-packages/

suggests that your environment on Perlmutter has become misconfigured somehow. Your packages from .local are taking priority over those from conda environments. A first step would be to move your .local to something like old-local` and try again.

But it would be worth figuring out what workflow of yours is installing packages there in the first place. Whatever is causing that to happen, it does not seem to be a workflow that is compatible with conda environments. It would be worth checking in your .bashrc, .bash_profile, .bashrc.ext, .bash_profile.ext, etc. to see if there is also something that is making this happen, like loading the module for NERSC's anaconda package or some sort of alternative activation besides what Mambaforge would provide.

My first guess about what would cause this is using NERSC's Anaconda module to create your own conda environments that get installed in your .local. Even so, that doesn't explain why the are not in their own environment but are just in .local/lib directly, which I don't think is the usual behavior for Anaconda (but I don't use it myself so I'm not sure).

Another possibility is that you care creating environments using another tool like virtualenv or pip that I'm not very familiar with. I know that pre-commit installs things in your home directory somewhere and doesn't use conda (or mamba) to do it. But that wouldn't explain how zppy got installed there!

In any case, I really want to get to the bottom of this because it has slowed down the E3SM-Unified release by approximately a week.

@forsyth2
Copy link
Collaborator Author

Re: points 1,2 above:

I pushed draft branches for Chrysalis (#487) and Perlmutter (#488) too.

A first step would be to move your .local

$ mv /global/homes/f/forsyth/.local /global/homes/f/forsyth/.old-local
$ source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh
$ cd /global/homes/f/forsyth/zppy_test
$ zppy -c test_np_int.cfg
# Only reruns the global time series tasks, since the others succeeded.
# Currently waiting on a compute node

@forsyth2
Copy link
Collaborator Author

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_np_int_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures

@xylar Thank you, that fix appears to work!!

@forsyth2
Copy link
Collaborator Author

But it would be worth figuring out what workflow of yours is installing packages there in the first place

I created #490 to look into this and how zppy is picking this up.

@forsyth2
Copy link
Collaborator Author

Marking this as resolved, since removing the .local directory is a decent work-around until #490 can be looked into more thoroughly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

No branches or pull requests

3 participants