Skip to content

MPI Failure when Running with 16+ Cores #550

@clarkpede

Description

@clarkpede

I've recently run into a problem with periodic geometry when I run a RANS problem on 16 cores or more (256+ MPI tasks). While initializing the Jacobian structure for the turbulence model, I run into one of two errors, depending on the core count.

The first error results in the following error message:

Fatal error in MPI_Sendrecv: Message truncated, error stack:
MPI_Sendrecv(249).................: MPI_Sendrecv(sbuf=0x2ee74f0, scount=10, MPI_DOUBLE, dest=19, stag=0, rbuf=0x2ee68e0, rcount=385, MPI_MPIDI_CH3U_Receive_data_found(144): Message from rank 25 and tag 0 truncated; 3200 bytes received but buffer size is 3080
aborting job

The second error just leads to the solver hanging indefinitely at the Initialize Jacobian structure (SA model) step. I'm guessing that an MPI send/receive is left dangling.

I have not seen these problems at lower core counts (2-4 cores with 2-32 MPI tasks).

The errors seem to be tied to the way the periodic send/receives are set up. If I change the periodic boundaries to far-field boundaries, the error vanishes.

I've also done a lot of work to weed out possible causes:

  • I've generated the meshes using both SU2_MSH and the su2perio Fortran tool.
  • I've run this on two different supercomputers, with different MPI builds.
  • I've tested multiple different meshes with different resolutions.
  • I've tried changing the RANS model and steady/unsteady options.
  • I've even used a different solver (our hybrid solver) that's completely independent of the RANS solver classes. Same error.
  • The problem occurs whether you're restarting or starting without a restart file.

I've got a minimal example that you can use to test this for yourself, in the attached files. It should be self-explanatory.

MPI_Failure_Example.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions