NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

matsbn · 2024-11-21T11:26:34Z

Describe the bug
Please provide a clear and concise description of what the bug is.

NorESM version: noresm2_5_alpha7
HPC platform: betzy
Compiler (if applicable): intel
Compset (if applicable): NOIIAJRA
Resolution (if applicable): TL319_tn14
Error message (if applicable):
123: [b3165:425392] *** An error occurred in MPI_Gather
123: [b3165:425392] *** reported by process [23226195771392,123]
123: [b3165:425392] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
123: [b3165:425392] *** MPI_ERR_TRUNCATE: message truncated
123: [b3165:425392] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
123: [b3165:425392] *** and potentially your MPI job)
0: slurmstepd: error: *** STEP 1033510.0 ON b3164 CANCELLED AT 2024-11-19T15:13:33 ***

To Reproduce
Steps to reproduce the behavior:

./xmlchange BLOM_VCOORD=cntiso_hybrid,BLOM_TURBULENT_CLOSURE=
./xmlchange NTASKS_OCN=354

Expected behavior
The model stops unexpectedly during what I believe is an I/O operation. I have found that reducing PIO_STRIDE in env_run.xml from $MAX_MPITASKS_PER_NODE (128 on Betzy) to 8 solves the problem. A hypothesis is that $MAX_MPITASKS_PER_NODE is larger that the CICE processor count in this configuration, leading to the stop during I/O. The setting of PIO_STRIDE to 8 is rather arbitrary and the functioning and performance of other PIO_STRIDE settings have not been explored.

gold2718 · 2024-11-21T11:41:53Z

@matsbn, do you have a pelayout handy for a failing case?

matsbn · 2024-11-21T12:11:14Z

The env_mach_pes.xml is pasted in below (could not attach xml-files) or did you actually mean the file pelayout in the case folder?

<?xml version="1.0"?>
<file id="env_mach_pes.xml" version="2.0">
  <header>
    These variables CANNOT be modified once case_setup has been
    invoked without first invoking case_setup -reset

    NTASKS: the total number of MPI tasks, a negative value indicates nodes rather than tasks.
    NTHRDS: the number of OpenMP threads per MPI task.
    ROOTPE: the global mpi task of the component root task, if negative, indicates nodes rather than tasks.
    PSTRID: the stride of MPI tasks across the global set of pes (for now set to 1)
    NINST : the number of component instances (will be spread evenly across NTASKS)

    for example, for NTASKS = 8, NTHRDS = 2, ROOTPE = 32, NINST  = 2
    the MPI tasks would be placed starting on global pe 32 and each pe would be threaded 2-ways
    These tasks will be divided amongst both instances (4 tasks each).

    Note: PEs that support threading never have an MPI task associated
    with them for performance reasons.  As a result, NTASKS and ROOTPE
    are relatively independent of NTHRDS and they determine
    the layout of mpi processors between components.  NTHRDS is used
    to determine how those mpi tasks should be placed across the machine.

    The following values should not be set by the user since they'll be
    overwritten by scripts: TOTALPES, NTASKS_PER_INST
    </header>
  <comment>none</comment>
  <group id="mach_pes">
    <entry id="ESMF_AWARE_THREADING" value="FALSE">
      <type>logical</type>
      <valid_values>TRUE,FALSE</valid_values>
      <desc>TRUE indicates that the ESMF Aware threading method is used</desc>
    </entry>
    <entry id="ALLOCATE_SPARE_NODES" value="FALSE">
      <type>logical</type>
      <valid_values>TRUE,FALSE</valid_values>
      <desc>Allocate some spare nodes to handle node failures. The system will pick a reasonable number</desc>
    </entry>
    <entry id="FORCE_SPARE_NODES" value="-999">
      <type>integer</type>
      <desc>Force this exact number of spare nodes to be allocated</desc>
    </entry>
    <entry id="NTASKS">
      <type>integer</type>
      <values>
        <value compclass="ATM">128</value>
        <value compclass="CPL">128</value>
        <value compclass="OCN">354</value>
        <value compclass="WAV">128</value>
        <value compclass="GLC">128</value>
        <value compclass="ICE">96</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">128</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>number of tasks for each component</desc>
    </entry>
    <entry id="NTASKS_PER_INST">
      <type>integer</type>
      <values>
        <value compclass="ATM">128</value>
        <value compclass="OCN">354</value>
        <value compclass="WAV">128</value>
        <value compclass="GLC">128</value>
        <value compclass="ICE">96</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">128</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>Number of tasks per instance for each component. DO NOT EDIT: Set automatically by case.setup based on NTASKS, NINST and MULTI_DRIVER</desc>
    </entry>
    <entry id="NTHRDS">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="OCN">1</value>
        <value compclass="WAV">1</value>
        <value compclass="GLC">1</value>
        <value compclass="ICE">1</value>
        <value compclass="ROF">1</value>
        <value compclass="LND">1</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>number of threads for each task in each component</desc>
    </entry>
    <entry id="ROOTPE">
      <type>integer</type>
      <values>
        <value compclass="ATM">0</value>
        <value compclass="CPL">0</value>
        <value compclass="OCN">128</value>
        <value compclass="WAV">0</value>
        <value compclass="GLC">0</value>
        <value compclass="ICE">32</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">0</value>
        <value compclass="ESP">0</value>
      </values>
      <desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
    </entry>
    <entry id="MULTI_DRIVER" value="TRUE">
      <type>logical</type>
      <valid_values>TRUE</valid_values>
      <desc>MULTI_DRIVER mode provides a separate driver/coupler component for each
    ensemble member.  All components must have an equal number of members.
    Multidriver is always true for nuopc, variable is left for compatibility with the mct driver</desc>
    </entry>
    <entry id="NINST" value="1">
      <type>integer</type>
      <desc>Number of instances of the model.
    </desc>
    </entry>
    <entry id="PSTRID">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="OCN">1</value>
        <value compclass="WAV">1</value>
        <value compclass="GLC">1</value>
        <value compclass="ICE">1</value>
        <value compclass="ROF">1</value>
        <value compclass="LND">1</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>The mpi global processors stride associated with the mpi tasks for the a component</desc>
    </entry>
    <entry id="NGPUS_PER_NODE" value="0">
      <type>integer</type>
      <desc> Number of GPUs per node used for simulation </desc>
    </entry>
  </group>
  <group id="mach_pes_last">
    <entry id="COST_PES" value="512">
      <type>integer</type>
      <desc>pes or cores used relative to MAX_MPITASKS_PER_NODE for accounting (0 means TOTALPES is valid)</desc>
    </entry>
    <entry id="TOTALPES" value="482">
      <type>integer</type>
      <desc>total number of MPI tasks (setup automatically - DO NOT EDIT)</desc>
    </entry>
    <entry id="MAX_TASKS_PER_NODE" value="128">
      <type>integer</type>
      <desc>maximum number of tasks/ threads allowed per node </desc>
    </entry>
    <entry id="MAX_MPITASKS_PER_NODE" value="128">
      <type>integer</type>
      <desc>pes or cores per node for accounting purposes </desc>
    </entry>
    <entry id="MAX_CPUTASKS_PER_GPU_NODE" value="0">
      <type>integer</type>
      <desc> Number of CPU cores per GPU node used for simulation </desc>
    </entry>
    <entry id="MAX_GPUS_PER_NODE" value="0">
      <type>integer</type>
      <desc>maximum number of GPUs allowed per node </desc>
    </entry>
    <entry id="COSTPES_PER_NODE" value="$MAX_MPITASKS_PER_NODE">
      <type>integer</type>
      <desc>pes or cores per node for accounting purposes </desc>
    </entry>
  </group>
  <group id="run_pio">
    <entry id="PIO_ASYNCIO_NTASKS" value="0">
      <type>integer</type>
      <desc>Task count for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
    <entry id="PIO_ASYNCIO_STRIDE" value="0">
      <type>integer</type>
      <desc>Stride of tasks for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
    <entry id="PIO_ASYNCIO_ROOTPE" value="1">
      <type>integer</type>
      <desc>RootPE of tasks for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
  </group>
</file>

gold2718 · 2024-11-21T12:51:50Z

Sorry, I meant the output from the ./pelayout command. Sorry for not being clear.
However, the information I need is in the file contents you posted so thanks!

mvertens · 2024-11-22T12:41:39Z

I've just checked the latest PE-layout on derecho - and the PIO_STRIDE is 128

PIO_STRIDE: ['CPL:$MAX_MPITASKS_PER_NODE', 'ATM:$MAX_MPITASKS_PER_NODE', 'LND:$MAX_MPITASKS_PER_NODE', 'ICE:$MAX_MPITASKS_PER_NODE', 'OCN:$MAX_MPITASKS_PER_NODE', 'ROF:$MAX_MPITASKS_PER_NODE', 'GLC:$MAX_MPITASKS_PER_NODE', 'WAV:$MAX_MPITASKS_PER_NODE', 'ESP:$MAX_MPITASKS_PER_NODE']

mvertens · 2024-11-22T14:19:42Z

@jedwards4b - do you have any input about using a PIO_STRIDE that is less than the number of tasks for a given component? This applies to the statement from @matsbn -
"A hypothesis is that $MAX_MPITASKS_PER_NODE is larger that the CICE processor count in this configuration, leading to the stop during I/O. The setting of PIO_STRIDE to 8 is rather arbitrary and the functioning and performance of other PIO_STRIDE settings have not been explored."

jedwards4b · 2024-11-22T14:29:48Z

There should be code that prevents this situation from occurring. ICE_PIO_STRIDE needs to be <= ICE_NTASKS

mvertens · 2024-11-22T15:18:11Z

@jedwards4b - thanks for confirming this. I don't think there is code in cime to prevent this.

matsbn added the bug Something isn't working label Nov 21, 2024

matsbn added this to the NorESM2.5 milestone Nov 21, 2024

github-project-automation bot added this to NorESM Development Nov 21, 2024

github-project-automation bot moved this to Todo in NorESM Development Nov 21, 2024

matsbn assigned mvertens and monsieuralok Nov 21, 2024

TomasTorsvik mentioned this issue Nov 26, 2024

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 NorESMhub/BLOM#387

Closed

matsbn mentioned this issue Nov 27, 2024

Modified compsets with JRA interannual forcing to use year range 1958-2018 NorESMhub/BLOM#437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

matsbn commented Nov 21, 2024

gold2718 commented Nov 21, 2024

matsbn commented Nov 21, 2024

gold2718 commented Nov 21, 2024

mvertens commented Nov 22, 2024

mvertens commented Nov 22, 2024

jedwards4b commented Nov 22, 2024 •

edited

Loading

mvertens commented Nov 22, 2024

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

Comments

matsbn commented Nov 21, 2024

gold2718 commented Nov 21, 2024

matsbn commented Nov 21, 2024

gold2718 commented Nov 21, 2024

mvertens commented Nov 22, 2024

mvertens commented Nov 22, 2024

jedwards4b commented Nov 22, 2024 • edited Loading

mvertens commented Nov 22, 2024

jedwards4b commented Nov 22, 2024 •

edited

Loading