Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lat_lon_land from e3sm_diags for land diagnostics #548

Merged
merged 3 commits into from
Mar 1, 2024

Conversation

chengzhuzhang
Copy link
Collaborator

Resolves #518.

@chengzhuzhang
Copy link
Collaborator Author

chengzhuzhang commented Feb 13, 2024

The results include 3 types for e3sm_diags runs:

  • model vs obs (atm),
  • model vs model (atm)
  • model vs model (land)
    And ilamb

Link to the viewers: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.zhang40/tests/E3SMv3_dev/land_diags_try8/20231209.v3.LR.piControl-spinup.chrysalis/

Update: The configuration file that works so far:

[default]

input = /lcrc/group/e3sm2/ac.golaz/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis
output = /lcrc/group/e3sm/ac.zhang40/tests/20231209.v3.LR.piControl-spinup.chrysalis_land_diags_try7
case = 20231209.v3.LR.piControl-spinup.chrysalis
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/E3SMv3_dev/land_diags
partition = compute 
environment_commands = "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"
debug = True

active = True
walltime = "00:30:00"
years = "0051:0100:50",

    [[ atm_monthly_180x360_aave ]]
    frequency = "monthly"
    input_files = "eam.h0"
    input_subdir = archive/atm/hist
    mapping_file = /lcrc/group/e3sm/diagnostics/maps/map_ne30pg2_to_cmip6_180x360_aave.20200201.nc
    vars = ""

    [[ land_monthly_climo ]]
    frequency = "monthly"
    input_files = "elm.h0"
    input_subdir = archive/lnd/hist
    vars = ""

[ts]
active = True
walltime = "00:30:00"
years = "0051:0100:50",

    [[ atm_monthly_180x360_aave ]]
    frequency = "monthly"
    input_files = "eam.h0"
    input_subdir = "archive/atm/hist"
    mapping_file = /lcrc/group/e3sm/diagnostics/maps/map_ne30pg2_to_cmip6_180x360_aave.20200201.nc
    ts_fmt = "cmip"
#  
    [[ atm_monthly_glb ]]
    # Note global average won't work for 3D variables.
    frequency = "monthly"
    input_files = "eam.h0"
    input_subdir = "archive/atm/hist"
    mapping_file = "glb"

    [[ land_monthly ]]
    extra_vars = "landfrac"
    frequency = "monthly"
    input_files = "elm.h0"
    input_subdir = "archive/lnd/hist"
    mapping_file = /lcrc/group/e3sm/diagnostics/maps/map_r05_to_cmip6_180x360_aave.20231110.nc
    vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILICE,SOILLIQ,SOILWATER_10CM,TSA,TSOI,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
    ts_fmt = "cmip"
#  
    [[ land_monthly_glb ]]
    frequency = "monthly"
    input_files = "eam.h0"
    input_subdir = "archive/atm/hist"
    mapping_file = "glb"

[ilamb]
active = True
nodes = 8
walltime = "2:00:00"
partition = compute 
short_name = '20231209.v3.LR.piControl-spinup.chrysalis'
#grids = 'native'
grids = '180x360_aave'
ts_num_years = 50
years = "0051:0100:50"

  [[ land_monthly ]]

[e3sm_diags]
active = True
grid = '180x360_aave'
ref_final_yr = 2014
ref_start_yr = 1985
sets = "lat_lon",
short_name = '20231012.v3alpha04_trigrid_bgc.piControl.chrysalis'
ts_num_years = 50
walltime = "00:60:00"
years = "0051:0100:50",

    [[ atm_monthly_180x360_aave ]]
    partition = "compute"
    qos = "regular"
    sets = "lat_lon",

    [[ atm_monthly_180x360_aave_mvm ]]
    climo_subsection = "atm_monthly_180x360_aave"
    diff_title = "Difference"
    partition = "compute"
    qos = "regular"
    ref_final_yr = 1851
    ref_name = "v2.LR.historical_0201"
    ref_start_yr = 1850
    ref_years = "1850-1851",
    reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/atm/180x360_aave/clim"
    run_type = "model_vs_model"
    sets = "lat_lon",
    short_ref_name = "v2.LR.historical_0201"
    swap_test_ref = False
    tag = "model_vs_model"
    ts_num_years_ref = 2
    ts_subsection = "atm_monthly_180x360_aave"
    years = "0051:0100:50",

    [[ lnd_monthly_mvm_lnd ]]
    # Test model-vs-model using the same files as the reference
    grid = 'native'
    climo_subsection = "land_monthly_climo"
    diff_title = "Difference"
    partition = "compute"
    qos = "regular"
    ref_name = "20231209.v3.LR.piControl-spinup.chrysalis"
    ref_start_yr = 0051
    ref_final_yr = 0100
    ref_years = "0051-0100",
    reference_data_path = "/lcrc/group/e3sm/ac.zhang40/tests/20231209.v3.LR.piControl-spinup.chrysalis_land_diags/post/lnd/native/clim"
    run_type = "model_vs_model"
    sets = "lat_lon_land",
    short_ref_name = "same simulation"
    swap_test_ref = False
    tag = "model_vs_model"
    ts_num_years_ref = 50                                                    

@forsyth2
Copy link
Collaborator

@chengzhuzhang Thanks for figuring out the necessary changes to add lat_lon_land. I checked the unit tests passed and I updated the integration tests to reflect the cfg you posted above. I'm currently running the integration tests. I'll add a commit for any fixes and then merge this PR. (And then we can close #534).

@forsyth2
Copy link
Collaborator

I'm trying to debug the complete_run failures. I'm completely baffled that ts_land_monthly_1850-1851-0002 fails, even after setting it to match exactly what's in the current template cfg. Relevant cfg sections below:

Working as of #424 (just merged):

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-424/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/test-424"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

Failing in my complete_run test of this PR:

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py                                                                                                                    
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/pr-548"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

These are identical aside from the output/www path. Yet, testing this PR gives me:

tail -n 16 ts_land_monthly_1850-1851-0002.o471575
2024-02-16 00:31:28,562 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562_562:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 00:31:28,562_562:INFO:_get_handlers:--------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562_562:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 00:31:28,562_562:INFO:_get_handlers:--------------------------------------
2024-02-16 00:31:28,562 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 00:31:28,562 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 00:31:28,562_562:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0511: task 0: Exited with exit code 1

For reference, the following is the relevant sections from the working cfg in #548 (comment)

[default]

input = /lcrc/group/e3sm2/ac.golaz/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis
output = /lcrc/group/e3sm/ac.zhang40/tests/20231209.v3.LR.piControl-spinup.chrysalis_land_diags_try7
case = 20231209.v3.LR.piControl-spinup.chrysalis
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/E3SMv3_dev/land_diags
partition = compute 
environment_commands = "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"
debug = True

active = True
walltime = "00:30:00"
years = "0051:0100:50",

[ts]
active = True
walltime = "00:30:00"
years = "0051:0100:50",

    [[ land_monthly ]]
    extra_vars = "landfrac"
    frequency = "monthly"
    input_files = "elm.h0"
    input_subdir = "archive/lnd/hist"
    mapping_file = /lcrc/group/e3sm/diagnostics/maps/map_r05_to_cmip6_180x360_aave.20231110.nc
    vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILICE,SOILLIQ,SOILWATER_10CM,TSA,TSOI,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
    ts_fmt = "cmip"

I tried running with all those variables earlier, but ran into a number of regridding failures.

@chengzhuzhang
Copy link
Collaborator Author

chengzhuzhang commented Feb 16, 2024

@chengzhuzhang Thanks for figuring out the necessary changes to add lat_lon_land. I checked the unit tests passed and I updated the integration tests to reflect the cfg you posted above. I'm currently running the integration tests. I'll add a commit for any fixes and then merge this PR. (And then we can close #534).

@forsyth2 could you clarify if the test complete run cfg is updated by you? If yes, what is the change? It would be easier for trouble shooting if you can commit the code change.

@chengzhuzhang
Copy link
Collaborator Author

I'm trying to debug the complete_run failures. I'm completely baffled that ts_land_monthly_1850-1851-0002 fails, even after setting it to match exactly what's in the current template cfg. Relevant cfg sections below:

Working as of #424 (just merged):

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-424/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/test-424"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

Failing in my complete_run test of this PR:

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py                                                                                                                    
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548/v2.LR.historical_0201"
partition = "debug"
qos = "regular"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/pr-548"

[ts]
active = True
walltime = "00:30:00"
years = "1850:1854:2",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,RH2M"
  ts_fmt = "cmip"

These are identical aside from the output/www path. Yet, testing this PR gives me:

tail -n 16 ts_land_monthly_1850-1851-0002.o471575
2024-02-16 00:31:28,562 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562_562:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 00:31:28,562_562:INFO:_get_handlers:--------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562_562:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 00:31:28,562 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 00:31:28,562_562:INFO:_get_handlers:--------------------------------------
2024-02-16 00:31:28,562 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 00:31:28,562 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 00:31:28,562_562:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0511: task 0: Exited with exit code 1

For reference, the following is the relevant sections from the working cfg in #548 (comment)

[default]

input = /lcrc/group/e3sm2/ac.golaz/E3SMv3/20231209.v3.LR.piControl-spinup.chrysalis
output = /lcrc/group/e3sm/ac.zhang40/tests/20231209.v3.LR.piControl-spinup.chrysalis_land_diags_try7
case = 20231209.v3.LR.piControl-spinup.chrysalis
www = /lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/E3SMv3_dev/land_diags
partition = compute 
environment_commands = "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"
debug = True

active = True
walltime = "00:30:00"
years = "0051:0100:50",

[ts]
active = True
walltime = "00:30:00"
years = "0051:0100:50",

    [[ land_monthly ]]
    extra_vars = "landfrac"
    frequency = "monthly"
    input_files = "elm.h0"
    input_subdir = "archive/lnd/hist"
    mapping_file = /lcrc/group/e3sm/diagnostics/maps/map_r05_to_cmip6_180x360_aave.20231110.nc
    vars = "FSH,RH2M,LAISHA,LAISUN,QINTR,QOVER,QRUNOFF,QSOIL,QVEGE,QVEGT,SOILICE,SOILLIQ,SOILWATER_10CM,TSA,TSOI,H2OSNO,TOTLITC,CWDC,SOIL1C,SOIL2C,SOIL3C,SOIL4C,WOOD_HARVESTC,TOTVEGC,NBP,GPP,AR,HR"
    ts_fmt = "cmip"

I tried running with all those variables earlier, but ran into a number of regridding failures.

The regridding failures could be from the misuse of mapping file. In my example, I used map_r05_to_cmip6_180x360_aave.20231110.nc for tri-grid v3 output, but for the v2 output you have been testing, ne30gp2_to cmip6 is needed.

@forsyth2
Copy link
Collaborator

@chengzhuzhang

could you clarify if the test complete run cfg is updated by you?

Yes, I couldn't use your cfg exactly because you used different input (notably v3 rather than v2).

If yes, what is the change? It would be easier for trouble shooting if you can commit the code change.

See updates to tests/integration/template_complete_run.cfg in 61f99c3 and 998c87d.

I'm going to try rebasing the code on to the latest main and see if anything changes.

The regridding failures could be from the misuse of mapping file. In my example, I used map_r05_to_cmip6_180x360_aave.20231110.nc for tri-grid v3 output, but for the v2 output you have been testing, ne30gp2_to cmip6 is needed.

Yes, I noticed the mapping file change but found that indeed v2 needs the original mapping file.

@forsyth2
Copy link
Collaborator

I'm going to try rebasing the code

No, still fails:

tail -n 16 ts_land_monthly_1850-1851-0002.o471956 
2024-02-16 19:35:56,741 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 19:35:56,741 [WARNING]: utils.py(derive_handlers:218) >> No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 19:35:56,741_741:WARNING:derive_handlers:No handlers could be derived for the variables: ['mrsos', 'mrso', 'mrfso', 'mrros', 'mrro', 'prveg', 'evspsblveg', 'evspsblsoi', 'tran', 'tsl', 'lai', 'cLitter', 'cProduct', 'cSoilFast', 'cSoilMedium', 'cSoilSlow', 'fFire', 'fHarvest', 'cVeg', 'nbp', 'gpp', 'ra', 'rh']. Make sure the input E3SM datasets have the variables needed derivation.
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:220) >> --------------------------------------
2024-02-16 19:35:56,741_741:INFO:_get_handlers:--------------------------------------
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:221) >> | Derived CMIP6 Variable Handlers
2024-02-16 19:35:56,741_741:INFO:_get_handlers:| Derived CMIP6 Variable Handlers
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 19:35:56,741 [INFO]: __main__.py(_get_handlers:222) >> --------------------------------------
2024-02-16 19:35:56,741_741:INFO:_get_handlers:--------------------------------------
2024-02-16 19:35:56,741 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 19:35:56,741 [ERROR]: __main__.py(_get_handlers:230) >> No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
2024-02-16 19:35:56,741_741:ERROR:_get_handlers:No CMIP6 variable handlers were derived from the variables found in using the E3SM input datasets.
srun: error: chr-0499: task 0: Exited with exit code 1

There's nothing in this PR that would cause that task to fail. It just worked when testing #424. I don't understand how it could possibly be a flaky test (https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html)...

@chengzhuzhang
Copy link
Collaborator Author

I will try if i can reproduce this afternoon

@chengzhuzhang
Copy link
Collaborator Author

@forsyth2 I checked the test for PR 424. /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-424/v2.LR.historical_0201/post/lnd/180x360_aave/ts/monthly/2yr/
Included variables are: LAISUN, LAISHA, FSH, RH2M
While in the new test, only FSH, and RH2M are in the corresponding folder. And the variables FSH and RH2M don't have e3sm_to_cmip handlers to generate cmip like variable (reference: https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/925500501/Lmon+variable+conversion+table), so errors are expected.
I guess it is the change of var list caused the error somehow.

@forsyth2
Copy link
Collaborator

@chengzhuzhang Thanks for checking. I did a thorough review of what's going on in #549. For the purposes of testing this, I will re-run jobs until I get dependencies succeeding, and then make sure the integration tests pass. Testing will have this added complication until #549 can be fully resolved...

@forsyth2
Copy link
Collaborator

forsyth2 commented Feb 20, 2024

Test status so far (running on a rebased commit I haven't pushed yet):

Ran zppy -c tests/integration/generated/test_complete_run_chrysalis.cfg

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.status:ERROR (9)
ilamb_1850-1851.status:WAITING 473989
ilamb_1852-1853.status:WAITING 473990
ts_land_monthly_1850-1851-0002.status:ERROR (5)
$ cat e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.o473985 

===== RUN E3SM DIAGS =====

Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/scripts/tmp.473985.Uc0E/e3sm.py", line 35, in <module>
    runner.run_diags(params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 79, in run_diags
    params = self.get_run_parameters(parameters, use_cfg)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 131, in get_run_parameters
    self.parser.check_values_of_params(run_params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_diags/parser/core_parser.py", line 50, in check_values_of_params
    p.check_values()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_diags/parameter/core_parameter.py", line 213, in check_values
    raise RuntimeError(msg)
RuntimeError: You need to specify reference_data_path in the parameters file or via the command line using --reference_data_path
srun: error: chr-0290: task 0: Exited with exit code 1

real	0m17.785s
user	0m0.007s
sys	0m0.006s
$ tail -n 20 ts_land_monthly_1850-1851-0002.o473969 
2024-02-20 18:56:50,585 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-20 18:56:50,585_585:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
100%|██████████| 1/1 [00:00<00:00,  1.47it/s]
2024-02-20 18:56:50,823 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-20 18:56:50,823 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-20 18:56:50,823_823:INFO:_run_parallel:0 of 1 handlers complete
2024-02-20 18:56:50,824 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-20 18:56:50,824 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-20 18:56:50,824_824:ERROR:_run_parallel:lai failed to complete
2024-02-20 18:56:50,824 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-20 18:56:50,824 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-20 18:56:50,824_824:ERROR:_run_parallel:0 of 1 handlers complete
'LAISUN'
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

=>

  • Need to specify reference_data_path in the parameters file for e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851. But that is already specified...
[[ lnd_monthly_mvm_lnd ]]
...
reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/lnd/native/clim"

@forsyth2
Copy link
Collaborator

Re-running

$ tail -n 20 ts_land_monthly_1850-1851-0002.o474039 
2024-02-20 22:04:45,724 [INFO]: handler.py(cmorize:247) >> lai: creating CMOR variable with CMOR axis objects.
2024-02-20 22:04:45,724_724:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
100%|██████████| 1/1 [00:00<00:00,  1.79it/s]
2024-02-20 22:04:45,965 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-20 22:04:45,965 [INFO]: __main__.py(_run_parallel:930) >> 0 of 1 handlers complete
2024-02-20 22:04:45,965_965:INFO:_run_parallel:0 of 1 handlers complete
2024-02-20 22:04:45,965 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-20 22:04:45,965 [ERROR]: __main__.py(_run_parallel:934) >> lai failed to complete
2024-02-20 22:04:45,965_965:ERROR:_run_parallel:lai failed to complete
2024-02-20 22:04:45,965 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-20 22:04:45,965 [ERROR]: __main__.py(_run_parallel:935) >> 0 of 1 handlers complete
2024-02-20 22:04:45,965_965:ERROR:_run_parallel:0 of 1 handlers complete
NetCDF: Not a valid ID
mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

Further reruns

I tried re-running various times, a few times with vars = "LAISHA,LAISUN" changed back to original vars = "FSH, RH2M" for [[ land_monthly ]].

Ultimately though, running sbatch ts_land_monthly_1850-1851-0002.bash directly (e.g., not through zppy) seemed to work.

ILAMB is able to run once that dependency is taken care of. Still running into an E3SM Diags error though:

$ cat e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1850-1851_vs_1850-1851.o474199 
cp: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/native/clim/2yr/v2.LR.historical_0201_*_1850??_1851??_climo.nc': No such file or directory
$ ls ls /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/pr-548-rebased-20240220/v2.LR.historical_0201/post/lnd/
180x360_aave
# No "native" directory

@chengzhuzhang
Copy link
Collaborator Author

@forsyth2 what is the e3sm_diags error? the reason native grids would work for v3 is because tri-grid output regular lat-lon data. For the case of v2 data you have been testing, it needs to be regrided, so native grids data won't work as input for e3sm_diags.

@forsyth2
Copy link
Collaborator

For the case of v2 data you have been testing, it needs to be regrided, so native grids data won't work as input for e3sm_diags.

Ah, so should I be using the 180x360 grid, as before?

Follow-up: should zppy still be testing on v2 or should we transition the tests to start using v3?

@chengzhuzhang
Copy link
Collaborator Author

For the case of v2 data you have been testing, it needs to be regrided, so native grids data won't work as input for e3sm_diags.

Ah, so should I be using the 180x360 grid, as before?

Follow-up: should zppy still be testing on v2 or should we transition the tests to start using v3?

Yes, for now, we should continue to use the 180x360grid when testing v2. But should transition to use v3 datasets, or with a new configure file.

@forsyth2 forsyth2 mentioned this pull request Feb 21, 2024
3 tasks
@forsyth2
Copy link
Collaborator

But should transition to use v3 datasets, or with a new configure file.

#552

@forsyth2
Copy link
Collaborator

Now that #549 has resolved, I'm running the tests on this PR so that I can update the expected images to include lat_lon_land (and see if there are any remaining issues).

@forsyth2
Copy link
Collaborator

Interestingly still seeing the concurrent/futures/_base.py error in the bundles test using the latest e3sm_to_cmip master. The complete_run test ran fine though.

@forsyth2
Copy link
Collaborator

forsyth2 commented Feb 23, 2024

Never mind, I still had vars = "FSH,RH2M" in the bundles test instead of vars = "LAISHA,LAISUN". Re-running.

Also, I didn't see concurrent/futures/_base.py in the log file when I tried running the ts_land task with sbatch. So, it's possible concurrent/futures/_base.py appeared somewhere else in the bundle output but wasn't error-causing...

@forsyth2
Copy link
Collaborator

Actually the error does seem to still happen:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/pr-548-20240223v2/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
bundle1.status:ERROR
ts_land_monthly_1852-1853-0002.status:ERROR (5)
$ sbatch ts_land_monthly_1852-1853-0002.bash 
$ grep -v "OK" *status
bundle1.status:ERROR
# Running sbatch directly worked.
$ grep -in "concurrent/futures" bundle1.o476037 
1022:  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 456, in result
1025:  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
1534:  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 456, in result
1537:  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result

@forsyth2 forsyth2 force-pushed the add_lat_lon_land_e3sm_diags branch 2 times, most recently from f7ed028 to 77fe3fb Compare February 26, 2024 23:51
@chengzhuzhang
Copy link
Collaborator Author

Yes, it does look like thee3sm_to_cmip concurrency issue is still happening, with the standalone e3sm_to_cmip generated last friday:

  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/site-packages/e3sm_to_cmip/__main__.py", line 931, in _run_parallel
    out = res.result()
          ^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_to_cmip_20240223/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
^M 50%|█████     | 1/2 [00:02<00:02,  2.96s/it]2024-02-23 23:58:46,731 [INFO]: __main__.py(_run_parallel:940) >> Finished tas, 2/2 jobs complete
2024-02-23 23:58:46,731 [INFO]: __main__.py(_run_parallel:940) >> Finished tas, 2/2 jobs complete
2024-02-23 23:58:46,731_731:INFO:_run_parallel:Finished tas, 2/2 jobs complete
^M100%|██████████| 2/2 [00:02<00:00,  1.48s/it]
2024-02-23 23:58:46,763 [INFO]: __main__.py(_run_parallel:949) >> 1 of 2 handlers complete
2024-02-23 23:58:46,763 [INFO]: __main__.py(_run_parallel:949) >> 1 of 2 handlers complete
2024-02-23 23:58:46,763_763:INFO:_run_parallel:1 of 2 handlers complete
2024-02-23 23:58:46,763 [ERROR]: __main__.py(_run_parallel:953) >> pr failed to complete
2024-02-23 23:58:46,763 [ERROR]: __main__.py(_run_parallel:953) >> pr failed to complete
2024-02-23 23:58:46,763_763:ERROR:_run_parallel:pr failed to complete
2024-02-23 23:58:46,763 [ERROR]: __main__.py(_run_parallel:954) >> 1 of 2 handlers complete
2024-02-23 23:58:46,763 [ERROR]: __main__.py(_run_parallel:954) >> 1 of 2 handlers complete
2024-02-23 23:58:46,763_763:ERROR:_run_parallel:1 of 2 handlers complete
NetCDF: Not a valid ID
==============================================
Elapsed time: 52 seconds

@chengzhuzhang
Copy link
Collaborator Author

Searching for the error NetCDF: Not a valid ID. I landed with a similar issue and a potential solution.
To summarise in this thread, it looks like a work-around in netcdf4-python to deal with netcdf-c not being thread safe was removed in 1.6.1. The solution (for now) is to [make sure your cluster only uses 1 thread per worker](https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389/14).

@tomvothecoder In this case, it might be worth to try by using parallel=False instead in xarray.open_mfdataset to try by passing this issue.

@tomvothecoder
Copy link
Collaborator

Searching for the error NetCDF: Not a valid ID. I landed with a similar issue and a potential solution. To summarise in this thread, it looks like a work-around in netcdf4-python to deal with netcdf-c not being thread safe was removed in 1.6.1. The solution (for now) is to [make sure your cluster only uses 1 thread per worker](https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389/14).

@tomvothecoder In this case, it might be worth to try by using parallel=False instead in xarray.open_mfdataset to try by passing this issue.

Thanks for finding this helpful thread! Jason might have been right that the filesystem on the server does not behave well with Xarray's parallel access at random times.

I'll make a new release of e3sm_to_cmip with parallel=False to hopefully workaround this error for good.

@tomvothecoder
Copy link
Collaborator

I'll make a new release of e3sm_to_cmip with parallel=False to hopefully workaround this error for good.

e3sm_to_cmip v1.11.2rc2 will be available on conda-forge shortly.

Feedstock PR: conda-forge/e3sm_to_cmip-feedstock#35

@chengzhuzhang
Copy link
Collaborator Author

@forsyth2 did you have a chance to test the new e3sm_to_cmip version?

@forsyth2
Copy link
Collaborator

forsyth2 commented Mar 1, 2024

@chengzhuzhang I began testing yesterday and will have a full update later today. So far, so good though.

@forsyth2 forsyth2 force-pushed the add_lat_lon_land_e3sm_diags branch from 8e5b1af to bd70de6 Compare March 1, 2024 20:00
Copy link
Collaborator

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang Tests now passing. @tomvothecoder Looks like the latest e3sm_to_cmip fully resolves the concurrency error.

@forsyth2 forsyth2 merged commit c3f463d into main Mar 1, 2024
4 checks passed
@forsyth2 forsyth2 deleted the add_lat_lon_land_e3sm_diags branch March 1, 2024 20:04
@forsyth2 forsyth2 mentioned this pull request Mar 1, 2024
@forsyth2 forsyth2 mentioned this pull request Oct 16, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add new E3SM Diags set -- lat_lon_land
3 participants