Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

Open
matsbn opened this issue Nov 21, 2024 · 7 comments
Open

NorESM2.5 crash seemingly related to PIO_STRIDE setting #593

matsbn opened this issue Nov 21, 2024 · 7 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@matsbn
Copy link
Contributor

matsbn commented Nov 21, 2024

Describe the bug
Please provide a clear and concise description of what the bug is.

  • NorESM version: noresm2_5_alpha7
  • HPC platform: betzy
  • Compiler (if applicable): intel
  • Compset (if applicable): NOIIAJRA
  • Resolution (if applicable): TL319_tn14
  • Error message (if applicable):
    123: [b3165:425392] *** An error occurred in MPI_Gather
    123: [b3165:425392] *** reported by process [23226195771392,123]
    123: [b3165:425392] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
    123: [b3165:425392] *** MPI_ERR_TRUNCATE: message truncated
    123: [b3165:425392] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    123: [b3165:425392] *** and potentially your MPI job)
    0: slurmstepd: error: *** STEP 1033510.0 ON b3164 CANCELLED AT 2024-11-19T15:13:33 ***

To Reproduce
Steps to reproduce the behavior:

  1. ./xmlchange BLOM_VCOORD=cntiso_hybrid,BLOM_TURBULENT_CLOSURE=
  2. ./xmlchange NTASKS_OCN=354

Expected behavior
The model stops unexpectedly during what I believe is an I/O operation. I have found that reducing PIO_STRIDE in env_run.xml from $MAX_MPITASKS_PER_NODE (128 on Betzy) to 8 solves the problem. A hypothesis is that $MAX_MPITASKS_PER_NODE is larger that the CICE processor count in this configuration, leading to the stop during I/O. The setting of PIO_STRIDE to 8 is rather arbitrary and the functioning and performance of other PIO_STRIDE settings have not been explored.

@matsbn matsbn added the bug Something isn't working label Nov 21, 2024
@matsbn matsbn added this to the NorESM2.5 milestone Nov 21, 2024
@gold2718
Copy link

@matsbn, do you have a pelayout handy for a failing case?

@matsbn
Copy link
Contributor Author

matsbn commented Nov 21, 2024

The env_mach_pes.xml is pasted in below (could not attach xml-files) or did you actually mean the file pelayout in the case folder?

<?xml version="1.0"?>
<file id="env_mach_pes.xml" version="2.0">
  <header>
    These variables CANNOT be modified once case_setup has been
    invoked without first invoking case_setup -reset

    NTASKS: the total number of MPI tasks, a negative value indicates nodes rather than tasks.
    NTHRDS: the number of OpenMP threads per MPI task.
    ROOTPE: the global mpi task of the component root task, if negative, indicates nodes rather than tasks.
    PSTRID: the stride of MPI tasks across the global set of pes (for now set to 1)
    NINST : the number of component instances (will be spread evenly across NTASKS)

    for example, for NTASKS = 8, NTHRDS = 2, ROOTPE = 32, NINST  = 2
    the MPI tasks would be placed starting on global pe 32 and each pe would be threaded 2-ways
    These tasks will be divided amongst both instances (4 tasks each).

    Note: PEs that support threading never have an MPI task associated
    with them for performance reasons.  As a result, NTASKS and ROOTPE
    are relatively independent of NTHRDS and they determine
    the layout of mpi processors between components.  NTHRDS is used
    to determine how those mpi tasks should be placed across the machine.

    The following values should not be set by the user since they'll be
    overwritten by scripts: TOTALPES, NTASKS_PER_INST
    </header>
  <comment>none</comment>
  <group id="mach_pes">
    <entry id="ESMF_AWARE_THREADING" value="FALSE">
      <type>logical</type>
      <valid_values>TRUE,FALSE</valid_values>
      <desc>TRUE indicates that the ESMF Aware threading method is used</desc>
    </entry>
    <entry id="ALLOCATE_SPARE_NODES" value="FALSE">
      <type>logical</type>
      <valid_values>TRUE,FALSE</valid_values>
      <desc>Allocate some spare nodes to handle node failures. The system will pick a reasonable number</desc>
    </entry>
    <entry id="FORCE_SPARE_NODES" value="-999">
      <type>integer</type>
      <desc>Force this exact number of spare nodes to be allocated</desc>
    </entry>
    <entry id="NTASKS">
      <type>integer</type>
      <values>
        <value compclass="ATM">128</value>
        <value compclass="CPL">128</value>
        <value compclass="OCN">354</value>
        <value compclass="WAV">128</value>
        <value compclass="GLC">128</value>
        <value compclass="ICE">96</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">128</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>number of tasks for each component</desc>
    </entry>
    <entry id="NTASKS_PER_INST">
      <type>integer</type>
      <values>
        <value compclass="ATM">128</value>
        <value compclass="OCN">354</value>
        <value compclass="WAV">128</value>
        <value compclass="GLC">128</value>
        <value compclass="ICE">96</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">128</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>Number of tasks per instance for each component. DO NOT EDIT: Set automatically by case.setup based on NTASKS, NINST and MULTI_DRIVER</desc>
    </entry>
    <entry id="NTHRDS">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="OCN">1</value>
        <value compclass="WAV">1</value>
        <value compclass="GLC">1</value>
        <value compclass="ICE">1</value>
        <value compclass="ROF">1</value>
        <value compclass="LND">1</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>number of threads for each task in each component</desc>
    </entry>
    <entry id="ROOTPE">
      <type>integer</type>
      <values>
        <value compclass="ATM">0</value>
        <value compclass="CPL">0</value>
        <value compclass="OCN">128</value>
        <value compclass="WAV">0</value>
        <value compclass="GLC">0</value>
        <value compclass="ICE">32</value>
        <value compclass="ROF">16</value>
        <value compclass="LND">0</value>
        <value compclass="ESP">0</value>
      </values>
      <desc>ROOTPE (mpi task in MPI_COMM_WORLD) for each component</desc>
    </entry>
    <entry id="MULTI_DRIVER" value="TRUE">
      <type>logical</type>
      <valid_values>TRUE</valid_values>
      <desc>MULTI_DRIVER mode provides a separate driver/coupler component for each
    ensemble member.  All components must have an equal number of members.
    Multidriver is always true for nuopc, variable is left for compatibility with the mct driver</desc>
    </entry>
    <entry id="NINST" value="1">
      <type>integer</type>
      <desc>Number of instances of the model.
    </desc>
    </entry>
    <entry id="PSTRID">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="OCN">1</value>
        <value compclass="WAV">1</value>
        <value compclass="GLC">1</value>
        <value compclass="ICE">1</value>
        <value compclass="ROF">1</value>
        <value compclass="LND">1</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>The mpi global processors stride associated with the mpi tasks for the a component</desc>
    </entry>
    <entry id="NGPUS_PER_NODE" value="0">
      <type>integer</type>
      <desc> Number of GPUs per node used for simulation </desc>
    </entry>
  </group>
  <group id="mach_pes_last">
    <entry id="COST_PES" value="512">
      <type>integer</type>
      <desc>pes or cores used relative to MAX_MPITASKS_PER_NODE for accounting (0 means TOTALPES is valid)</desc>
    </entry>
    <entry id="TOTALPES" value="482">
      <type>integer</type>
      <desc>total number of MPI tasks (setup automatically - DO NOT EDIT)</desc>
    </entry>
    <entry id="MAX_TASKS_PER_NODE" value="128">
      <type>integer</type>
      <desc>maximum number of tasks/ threads allowed per node </desc>
    </entry>
    <entry id="MAX_MPITASKS_PER_NODE" value="128">
      <type>integer</type>
      <desc>pes or cores per node for accounting purposes </desc>
    </entry>
    <entry id="MAX_CPUTASKS_PER_GPU_NODE" value="0">
      <type>integer</type>
      <desc> Number of CPU cores per GPU node used for simulation </desc>
    </entry>
    <entry id="MAX_GPUS_PER_NODE" value="0">
      <type>integer</type>
      <desc>maximum number of GPUs allowed per node </desc>
    </entry>
    <entry id="COSTPES_PER_NODE" value="$MAX_MPITASKS_PER_NODE">
      <type>integer</type>
      <desc>pes or cores per node for accounting purposes </desc>
    </entry>
  </group>
  <group id="run_pio">
    <entry id="PIO_ASYNCIO_NTASKS" value="0">
      <type>integer</type>
      <desc>Task count for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
    <entry id="PIO_ASYNCIO_STRIDE" value="0">
      <type>integer</type>
      <desc>Stride of tasks for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
    <entry id="PIO_ASYNCIO_ROOTPE" value="1">
      <type>integer</type>
      <desc>RootPE of tasks for asyncronous IO, only valid if PIO_ASYNC_INTERFACE is True</desc>
    </entry>
  </group>
</file>

@gold2718
Copy link

Sorry, I meant the output from the ./pelayout command. Sorry for not being clear.
However, the information I need is in the file contents you posted so thanks!

@mvertens
Copy link

I've just checked the latest PE-layout on derecho - and the PIO_STRIDE is 128

PIO_STRIDE: ['CPL:$MAX_MPITASKS_PER_NODE', 'ATM:$MAX_MPITASKS_PER_NODE', 'LND:$MAX_MPITASKS_PER_NODE', 'ICE:$MAX_MPITASKS_PER_NODE', 'OCN:$MAX_MPITASKS_PER_NODE', 'ROF:$MAX_MPITASKS_PER_NODE', 'GLC:$MAX_MPITASKS_PER_NODE', 'WAV:$MAX_MPITASKS_PER_NODE', 'ESP:$MAX_MPITASKS_PER_NODE']

Screenshot 2024-11-22 at 1 40 27 PM

@mvertens
Copy link

@jedwards4b - do you have any input about using a PIO_STRIDE that is less than the number of tasks for a given component? This applies to the statement from @matsbn -
"A hypothesis is that $MAX_MPITASKS_PER_NODE is larger that the CICE processor count in this configuration, leading to the stop during I/O. The setting of PIO_STRIDE to 8 is rather arbitrary and the functioning and performance of other PIO_STRIDE settings have not been explored."

@jedwards4b
Copy link

jedwards4b commented Nov 22, 2024

There should be code that prevents this situation from occurring. ICE_PIO_STRIDE needs to be <= ICE_NTASKS

@mvertens
Copy link

@jedwards4b - thanks for confirming this. I don't think there is code in cime to prevent this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

5 participants