MPI backend of SST: scaling issue on ORNL Crusher #3439

franzpoeschel · 2023-01-18T13:36:41Z

Describe the bug
I am trying to use the MPI backend of SST in order to couple a PIConGPU simulation with an asynchronous data sink (currently a synthetic application that just loads data and then throws it away) . Both the simulation and the data sink are parallel MPI applications.

There are 8 instances of PIConGPU running on each node (one per GPU) and additionally one CPU-only instance per node of the data sink. The data sink loads data from the simulation instances running on the same node, in order to ensure a scalable communication pattern.

I use weak scaling to scale the setup, it runs without trouble on 1 node, 8 nodes and on 12 nodes. There is no obvious performance difference. On 16 nodes, the setup will hang and be finally killed after the time limit of the job. The last message being printed by SstVerbose on the reading end is MpiReadReplyHandler: Connecting to MPI Server. The writing end seems to be blocking on the QueueFullPolicy=Block condition. The reader receives all metadata without issue, so the trouble seems to be in the data plane.

I attach the stderr log of both sides (quite big files unfortunately since they have output from all parallel instances), they include SstVerbose log
reader.err.txt
writer.err.txt

To me, this sounds more like an issue with the Cray MPI implementation that just supports a certain amount of MPI ports being open, and not like an ADIOS2 problem? Maybe it's also related to the MPI_Finalize problem described here: https://github.com/ornladios/ADIOS2/blob/master/docs/user_guide/source/advanced/ecp_hardware.rst

Are similar issues known? Have there been scaling tests on Crusher and have they been successful? Is there some parameter that I need to set?

To Reproduce
Complex setup, hard to reproduce. I don't know if a synthetic setup would show the same issue.
The engine parameters are

on the writing end:

 "QueueLimit": "1"
"DataTransport": "mpi"
"InitialBufferSize": "4Gb" // should be irrelevant due to MarshalMethod
"Profile": "Off"
"Threads": "7"
"MarshalMethod": "BP5"

on the reading end:

"DataTransport": "mpi"
"Profile": "Off"
"OpenTimeoutSecs": "6000"
"SpeculativePreloadMode": "OFF"

Expected behavior
Continued scaling as from 1 to 12 nodes.

Desktop (please complete the following information):

> module list

Currently Loaded Modules:
  1) craype-x86-trento    4) perftools-base/22.06.0                  7) tmux/3.2a   10) craype/2.7.16          13) PrgEnv-cray/8.2.0  16) craype-accel-amd-gfx90a  19) cmake/3.21.3          22) zlib/1.2.11    25) freetype/2.11.0
  2) libfabric/1.15.0.0   5) xpmem/2.4.4-2.3_11.2__gff0e1d9.shasta   8) gdb/10.2    11) cray-dsmml/0.2.2       14) xalt/1.3.0         17) rocm/5.1.0               20) boost/1.79.0-cxx17    23) git/2.35.1
  3) craype-network-ofi   6) cray-pmi/6.1.3                          9) cce/14.0.2  12) cray-libsci/22.06.1.3  15) DefApps/default    18) cray-mpich/8.1.21        21) cray-python/3.9.12.1  24) libpng/1.6.37

> export CXX=hipcc
> export CXXFLAGS="$CXXFLAGS -I${MPICH_DIR}/include"
> export LDFLAGS="$LDFLAGS -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa"
> export CFLAGS="$CXXFLAGS -I${MPICH_DIR}/include"

> cmake ..
-- The C compiler identification is Clang 14.0.6
-- The CXX compiler identification is Clang 14.0.0
-- Cray Programming Environment 2.7.16 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.16/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm-5.1.0/bin/hipcc - skipped

ADIOS2 git tag 1428da5 (current master)

Additional context
I was going to try the new ucx backend as an alternative, but the cray-ucx module on Crusher has apparently not been installed correctly (build paths leaking into the pkgconfig), so I need to wait for the system support to fix that. I assume that ucx-based SST has not been tried on the system yet?

Following up

The text was updated successfully, but these errors were encountered:

eisenhauer · 2023-01-18T14:29:47Z

Hi Franz. I'll look into the MPI issue, but it may have to wait for @vicentebolea who's taking some vacation this month. However, I can weigh in on the UCX issues. Yes, the cray-ucx module is broken. The PR I just merged, #3437 lets you build on Crusher to use UCX despite that, by adding "-DPC_UCX_FOUND=IGNORE -DUCX_DIR=/opt/cray/pe/cray-ucx/2.7.0-1/ucx/" to your CMake line. Unfortunately the UCX data plane doesn't seem to work there after that despite building and linking properly. It looks like SST UCX needs at least version 1.9.0 (earlier versions don't have the required ucp request interface and so don't compile), but maybe it has to be newer yet. I've found that it works on a cluster with UCX 1.11. Regardless I'll put in an OLCF ticket for the UCX pkgconfig problem in the hopes that it might be better on frontier, if not on crusher.

franzpoeschel · 2023-01-18T15:38:38Z

Ah, then I'll try the ucx backend once more with the workaround, but I suppose that it won't work for me either. I also did not get ucx to run on my local machine, so it seems to be one of those backends that only likes some systems.

It would be interesting to hear from Vicente once he's back from vacation if he was able to scale SST to a greater portion of the system yet, and if there are any tricks. Otherwise, I fear that this needs to go to the ORNL support?

eisenhauer · 2023-01-18T20:01:51Z

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here:
ucx_output.txt

vicentebolea · 2023-01-23T08:41:18Z

It would be interesting to hear from Vicente once he's back from vacation if he was able to scale SST to a greater portion of the system yet, and if there are any tricks.

Hi @franzpoeschel, I was able to scale to the 100s of nodes with the MPI Dataplane. I am not sure what can be the issue with your setup. One thing that I notice is that the queuelimit=1 scale tests used a higher number such as 50 or so. queuelimit=1 might have triggered a deadlock that was missed during the scale tests. Ill have a look into that.

franzpoeschel · 2023-01-23T16:37:24Z

Concerning UCX on Crusher, Greg was apparently to get it to run, but he had to explicitly unload the cray-mpich module for that. I can confirm that this also ended up in a working UCX backend for me (except for many warnings on the terminal). Unfortunately, I can't seem to get PIConGPU to run in combination with the UCX-based MPICH on Crusher (error outside ADIOS2), so I can't do a production test right now.

What I noticed about the MPI data plane was that data transfer seemed surprisingly slow on Crusher. I only loaded very low amounts of data in the reader, but loading data still took sometimes more than 4 seconds, while the same setup with UCX on other systems is below half a second. Maybe there is something wrong with the MPI environment that I use?

franzpoeschel · 2023-01-23T16:46:37Z

Hi @franzpoeschel, I was able to scale to the 100s of nodes with the MPI Dataplane. I am not sure what can be the issue with your setup. One thing that I notice is that the queuelimit=1 scale tests used a higher number such as 50 or so. queuelimit=1 might have triggered a deadlock that was missed during the scale tests. Ill have a look into that.

How many writers and readers were there in your setup per node? I can try specifying queuelimit=0 and see if this changes things.

eisenhauer · 2023-01-23T17:56:32Z

@franzpoeschel I'm not sure we've got performance numbers comparing the MPI dataplane to a working RDMA dataplane on a HPC machine. I'll try to see if I can run some things on Crusher so we can evaluate...

sameehj · 2023-01-23T18:50:14Z

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

eisenhauer · 2023-01-23T19:00:17Z

Hi Sameeh. So in the intervening time we have managed to get the UCX dataplane working on Crusher. We had to explicitly unload the cray-mpich module and then load cray-mpich-ucx. I have not yet done real performance runs, but at least things work (and this is with UCX 1.9.0).

franzpoeschel · 2023-01-24T11:08:41Z

I think I might have found the issue. Looking into the SstVerbose log again, I missed the fact that the MPI data plane was not even found..
It seems that this commit 34e6160 removed the set(ADIOS2_SST_HAVE_MPI TRUE) line from the cmake/DetectOptions.cmake file, so ADIOS2 does not consider the MPI data plane. I updated the ADIOS2 version for UCX support and probably since then, I never actually used the MPI data plane any longer. I'll try out if this fixes it.

vicentebolea · 2023-01-24T11:52:16Z

How many writers and readers were there in your setup per node?

Scale tests went around 100 writers and 100 readers.

franzpoeschel · 2023-01-24T13:16:34Z

I think I might have found the issue.

No luck, unfortunately. It seems that I first ran into this issue when the MPI data plane was correctly loading, I now removed the #ifdef HAVE_SST_MPI lines and the MPI data plane is definitely loading according to the log, but the issue persists.

Scale tests went around 100 writers and 100 readers.

This means that each node hosts one writer and one reader?
This might actually be too small to trigger the issue. My test runs fine on 12 nodes, that is 96 writers (8 per node) and 12 readers (1 per node).
I see the hangup happening on 16 nodes, i.e. 128 writers and 16 readers.

I now tested setting QueueLimit=0. The consequence is that the simulation runs to completion (I am running a rather small simulation, so the extra memory needed by the queue is no issue), but the reader still hangs at the first step.

vicentebolea · 2023-01-25T09:12:38Z

It seems that this commit 34e6160 removed the set(ADIOS2_SST_HAVE_MPI TRUE) line from the cmake/DetectOptions.cmake file, so ADIOS2 does not consider the MPI data plane

Note that ADIOS2_SST_HAVE_MPI is not used anymore. However, it does not mean that it will disable the MPI data plane, what happens is that:

The MPI dataplane will be disabled if the following test build/run fails:

        #include <mpi.h>
        #include <stdlib.h>

        #if !defined(MPICH)
        #error "MPICH is the only supported library"
        #endif

        int main()
        {
          MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
          MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
          MPI_Finalize();
        }

As for the persistence of the issue, can you make sure that both the client and the servers are using the MPI DP?

franzpoeschel · 2023-01-25T10:07:11Z

Note that ADIOS2_SST_HAVE_MPI is not used anymore. However, it does not mean that it will disable the MPI data plane, what happens is that:

There are two #ifdef SST_HAVE_MPI blocks in source/adios2/toolkit/sst/dp/dp.c that don't get activated in my builds any more, and I find no place in the build system where that macro would be activated.

After removing these ifdefs, the log says that both ends are using the MPI dataplane:
Writer side:

Opening Stream "openPMD/simData"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Stream "openPMD/simData" waiting for 1 readers
MpiInitWriter initialized addr=0x1b44070

Reader side:

Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Sending Reader Activate messages to writer

Full writer log pic.err.txt
Full reader log pipe.err.txt

The exact same setup is running fine with the MPI dataplane at a lower node count.

My suspicion is that this is not really an ADIOS2 issue, but rather an issue with the scalability of the MPI_Port_open/MPI_Port_accept functionality on Crusher.

vicentebolea · 2023-01-26T10:23:25Z

There are two #ifdef SST_HAVE_MPI blocks in source/adios2/toolkit/sst/dp/dp.c that don't get activated in my builds any more, and I find no place in the build system where that macro would be activated.

This is correct, I accidentally introduced this regression in #3407

franzpoeschel · 2023-01-27T14:19:47Z

The scaling issue seemingly depends not only on the number of MPI tasks, but also on the number of loaded chunks. The PIConGPU simulation that I use writes 32 ADIOS2 variables, each rank writing one chunk. In my test so far, the reader requested all chunks written from ranks on the same node (i.e. 32 variables * 8 writers on the same node).
(Additionally at the beginning of the simulation, there is one variable written only by rank 0, but read by everyone)

As I am only interested in particle data, I restricted the loading procedures to only 20 of the 32 variables, and now there is no hangup at 16 nodes. I will test if this setup will hang at a higher node count.

franzpoeschel · 2023-01-30T10:09:15Z

The new setup hangs at 128 nodes on Crusher, still running fine at 64 nodes. This might help me come up with an ADIOS2-only minimal example to reproduce the issue.

franzpoeschel · 2023-02-07T13:16:51Z

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

Hello @sameehj, I have generated this data now. Streaming with SST-UCX works without issue within a single-node job. In order to run a multi-node job, specifying UCX_TLS=all is necessary, otherwise even MPI_Init will fail (using UCX requires using the UCX-backed MPI implementation. As Greg says, extant but somewhat broken).
However, SST-UCX hangs on two nodes even with this setting.
The zipfile below has the stdout and stderr of an SST writer and an SST reader, executed successfully on one node and without success on two nodes. The stderr log shows the SstVerbose log, the stdout log has the UCX log.

output.zip

sameehj · 2023-02-16T12:28:53Z

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

Hello @sameehj, I have generated this data now. Streaming with SST-UCX works without issue within a single-node job. In order to run a multi-node job, specifying UCX_TLS=all is necessary, otherwise even MPI_Init will fail (using UCX requires using the UCX-backed MPI implementation. As Greg says, extant but somewhat broken). However, SST-UCX hangs on two nodes even with this setting. The zipfile below has the stdout and stderr of an SST writer and an SST reader, executed successfully on one node and without success on two nodes. The stderr log shows the SstVerbose log, the stdout log has the UCX log.

output.zip

hmm rather odd, I took a look at your files

Interesting, I suspect two things:

SpeculativePreload set to Auto, Can you set it to off?
I set UCX_TLS to the specific rdma transports I want to use. i.e. UCX_TLS=ud_mlx,rc_mlx.

I don't think your UCX logs are visible in the two nodes case, we should be able to see the UCX detailed logs. are you sure you specified UCX_LOG_LEVEL=data?

best regards,
Sameeh

franzpoeschel · 2023-02-20T16:14:34Z

Thank you for the help, @sameehj
The SpeculativePreloadMode is interpreted only by the reader where it's turned off. I tried specifying it for the writer, too, but it did not help.
So far, I specified UCX_TLS=all. Trying to use UCX_TLS=rc,ud leads to the following error:

ucp_context.c:731  UCX  WARN  transports 'rc','ud' are not available, please use one or more of: cma, mm, posix, self, shm, sm, sysv, tcp, xpmem

I guess, without these transports available that using UCX on that system will have no use? Or is there a transport among these that has any merit trying? I think that UCX selected tcp so far.

I checked if I specified UCX_LOG_LEVEL=data in both setups, it seems that I did. It writes some output to the .out files, but not much.

franzpoeschel · 2023-02-22T16:04:51Z

I think that I'm starting to narrow down the issue with the MPI transport. The problem is twofold. Say that n is the count of writers, m the count of readers:

n -> 1 communication patterns. We have datasets to which every rank contributes some few items, that are loaded only on rank 0 of the reader. The end of the log is:

 0: MpiReadReplyHandler: Read recv from rank=126,condition=1538,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=1666,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=1794,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=1922,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=2050,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=2178,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=2306,size=8
 0: MpiReadReplyHandler: Read recv from rank=126,condition=2434,size=8
 0: MpiReadReplyHandler: Read recv from rank=127,condition=434,size=8
 0: MpiReadReplyHandler: Connecting to MPI Server

(The "from rank"s are counting up from 0, all log messages are from reading rank 0)

1 -> m communication patterns. We have some datasets that are needed on every reading rank, but that are written in their entirety by rank 0. The end of the log is

26: MpiReadReplyHandler: Connecting to MPI Server
26: Memory read to rank 0 with condition 1 andlength 11264 has completed
110: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264
110: MpiReadReplyHandler: Connecting to MPI Server
110: Memory read to rank 0 with condition 1 andlength 11264 has completed
102: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264
102: MpiReadReplyHandler: Connecting to MPI Server
102: Memory read to rank 0 with condition 1 andlength 11264 has completed
30: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264

The "has completed" message for rank 30 never appears.

Both issues cause hangups, the first at 16 nodes (n=168, m=16), the second at 128 nodes (n=1288, m=128). I have now implemented workarounds that use MPI_Gather -> rank 0 -> rank 0 -> MPI_BCast.

I'll try to adapt my reproducer to this.

sameehj · 2023-02-23T10:10:41Z

I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck.

franzpoeschel · 2023-02-23T14:20:25Z

reproducer.zip

This reproducer reproduces both issues mentioned above on the Crusher system. n->1 and 1->m communication patterns are not scalable on the system, using the MPI transport. I don't know if they should be, and neither do I know if this points out other scaling issues that might come up at full Frontier scale. If I remember correctly, neither communication pattern was an issue with libfabric on Summit.

The reproducer very closely resembles the IO patterns of PIConGPU. It uses (the metadata of) a BP4 dataset written by PIConGPU on 128 nodes as basis for creating an SST stream. A second C++ code reads the code, triggering both issues.
The actual bulk of the reading works well enough, the scaling issue is with a handful of smaller datasets that are exchanged in n->1 or 1->m patterns.

The ZIP file contains:

The metadata part of the BP output of PIConGPU
A synthetic writer and reader that resembles the IO workflow of PIConGPU, and a CMakeLists.txt for compiling it
A README.md that explains how to set this up and what issues can be observed how
A submit.sh batch script

> bpls simData_00000.bp
  float     /data/0/fields/B/x                                      {256, 2048, 768}
  float     /data/0/fields/B/y                                      {256, 2048, 768}
  float     /data/0/fields/B/z                                      {256, 2048, 768}
  float     /data/0/fields/E/x                                      {256, 2048, 768}
  float     /data/0/fields/E/y                                      {256, 2048, 768}
  float     /data/0/fields/E/z                                      {256, 2048, 768}
  float     /data/0/fields/e_all_chargeDensity                      {256, 2048, 768}
  float     /data/0/fields/e_all_energyDensity                      {256, 2048, 768}
  float     /data/0/fields/i_all_chargeDensity                      {256, 2048, 768}
  float     /data/0/fields/i_all_energyDensity                      {256, 2048, 768}
  uint64_t  /data/0/fields/picongpu_idProvider/nextId               {8, 16, 8}
  uint64_t  /data/0/fields/picongpu_idProvider/startId              {8, 16, 8}
  float     /data/0/particles/e/momentum/x                          {10066329600}
  float     /data/0/particles/e/momentum/y                          {10066329600}
  float     /data/0/particles/e/momentum/z                          {10066329600}
  uint64_t  /data/0/particles/e/particlePatches/extent/x            {1024}
  uint64_t  /data/0/particles/e/particlePatches/extent/y            {1024}
  uint64_t  /data/0/particles/e/particlePatches/extent/z            {1024}
  uint64_t  /data/0/particles/e/particlePatches/numParticles        {1024}
  uint64_t  /data/0/particles/e/particlePatches/numParticlesOffset  {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/x            {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/y            {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/z            {1024}
  float     /data/0/particles/e/position/x                          {10066329600}
  float     /data/0/particles/e/position/y                          {10066329600}
  float     /data/0/particles/e/position/z                          {10066329600}
  int32_t   /data/0/particles/e/positionOffset/x                    {10066329600}
  int32_t   /data/0/particles/e/positionOffset/y                    {10066329600}
  int32_t   /data/0/particles/e/positionOffset/z                    {10066329600}
  float     /data/0/particles/e/weighting                           {10066329600}
  float     /data/0/particles/i/momentum/x                          {10066329600}
  float     /data/0/particles/i/momentum/y                          {10066329600}
  float     /data/0/particles/i/momentum/z                          {10066329600}
  uint64_t  /data/0/particles/i/particlePatches/extent/x            {1024}
  uint64_t  /data/0/particles/i/particlePatches/extent/y            {1024}
  uint64_t  /data/0/particles/i/particlePatches/extent/z            {1024}
  uint64_t  /data/0/particles/i/particlePatches/numParticles        {1024}
  uint64_t  /data/0/particles/i/particlePatches/numParticlesOffset  {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/x            {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/y            {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/z            {1024}
  float     /data/0/particles/i/position/x                          {10066329600}
  float     /data/0/particles/i/position/y                          {10066329600}
  float     /data/0/particles/i/position/z                          {10066329600}
  int32_t   /data/0/particles/i/positionOffset/x                    {10066329600}
  int32_t   /data/0/particles/i/positionOffset/y                    {10066329600}
  int32_t   /data/0/particles/i/positionOffset/z                    {10066329600}
  float     /data/0/particles/i/weighting                           {10066329600}

franzpoeschel · 2023-02-23T14:29:37Z

I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck.

We have received info from the OLCF support that the UCX module on Crusher is not really supposed to be used. Since SST also has a direct TCP backend, I don't know if it would really be worth it to further try debugging this now.

vicentebolea · 2023-02-23T17:05:12Z

Franz Many thanks for providing an example source code to replicate this issue. I will look into it and get back to you. Vicente

…

On Thu, Feb 23, 2023 at 9:29 AM Franz Pöschel ***@***.***> wrote: I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck. We have received info from the OLCF support that the UCX module on Crusher is not really supposed to be used. Since SST also has a direct TCP backend, I don't know if it would really be worth it to further try debugging this now. — Reply to this email directly, view it on GitHub <#3439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHFOFX4B2U5LDKDLHYHKMTWY5X5XANCNFSM6AAAAAAT7DAWCY> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Vicente Adolfo Bolea Sanchez Senior R&D Engineer | Kitware, Inc

vicentebolea · 2023-04-17T22:35:36Z

@franzpoeschel I think I might know what can be the issue, however, I am having troubles to run your reproducer in Crusher, also not only your reproducer but any sample application using mpi client/server routines. I wonder if you could confirm that your reproducer still works.

franzpoeschel · 2023-04-20T12:43:42Z

I tried running SST+MPI on Frontier right now. I get an assertion error when trying to run this from inside MPI_Open_port, so it looks like that is indeed broken on the system.

Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7f73d22249ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7f73d1c5ebf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7f73d1f150d8]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7f73d1c98ef9]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7f73d1808839]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(+0x7b1f15) [0x7f73c597ef15]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(WriterParticipateInReaderOpen+0x206) [0x7f73c596bf86]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(SstWriterOpen+0x23e) [0x7f73c596cc5e]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::engine::SstWriter::SstWriter(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0xd7) [0x7f73c58ed647]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(std::shared_ptr<adios2::core::Engine> adios2::core::IO::MakeEngine<adios2::core::engine::SstWriter>(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x66) [0x7f73c54b12b6]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core_mpi.so.2(std::_Function_handler<std::shared_ptr<adios2::core::Engine> (adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm), std::shared_ptr<adios2::core::Engine> (*)(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)>::_M_invoke(std::_Any_data const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode&&, adios2::helper::Comm&&)+0x39) [0x7f73c59e8609]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x843) [0x7f73c5486333]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0x50) [0x7f73c5487ad0]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_cxx11.so.2(adios2::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0xdf) [0x7f73d3fb8b2f]

franzpoeschel · 2023-04-20T13:03:33Z

Yep, even the minimal example fails:

#include <mpi.h>
#include <stdlib.h>

#if !defined(MPICH)
#error "MPICH is the only supported library"
#endif

int main()
{
    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
    MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
    MPI_Finalize();
}

--->

> ./mpi_minimal 
Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fca4c6499ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fca4c083bf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fca4c33a0d8]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fca4c0bdef9]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fca4bc2d839]
./mpi_minimal() [0x2018c8]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fca4965129d]
./mpi_minimal() [0x2017ea]
MPICH ERROR [Rank 0] [job id ] [Thu Apr 20 09:03:20 2023] [login03] - Abort(1): Internal error

vicentebolea · 2023-04-20T17:10:43Z

I see, it's good to see that it is not only me, I have also tried a previous version of the cray-mpich module but it's still the same issue. I will give a try to load older cray-mpich versions but definitely we need to report this. I wonder if we have the same issue in Frontier.

…

On Thu, Apr 20, 2023 at 9:03 AM Franz Pöschel ***@***.***> wrote: Yep, even the minimal example fails: #include <mpi.h> #include <stdlib.h> #if !defined(MPICH) #error "MPICH is the only supported library" #endif int main() { MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL); MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME)); MPI_Finalize(); } ---> > ./mpi_minimal Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0 /opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fca4c6499ab] /opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fca4c083bf4] /opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fca4c33a0d8] /opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fca4c0bdef9] /opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fca4bc2d839] ./mpi_minimal() [0x2018c8] /lib64/libc.so.6(__libc_start_main+0xef) [0x7fca4965129d] ./mpi_minimal() [0x2017ea] MPICH ERROR [Rank 0] [job id ] [Thu Apr 20 09:03:20 2023] [login03] - Abort(1): Internal error — Reply to this email directly, view it on GitHub <#3439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHFOFXYAJVHARHSBFCRVSDXCEX3BANCNFSM6AAAAAAT7DAWCY> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Vicente Adolfo Bolea Sanchez Senior R&D Engineer | Kitware, Inc

franzpoeschel · 2023-04-20T20:09:19Z

It was Frontier where I saw this. I wanted to try some things tomorrow, and if those don't help, we definitely need to report.

franzpoeschel · 2023-04-21T13:17:26Z

I just sent a report, you are in CC

vicentebolea · 2023-04-21T17:10:25Z

Got it. Thanks!

…

On Fri, Apr 21, 2023 at 9:17 AM Franz Pöschel ***@***.***> wrote: I just sent a report, you are in CC — Reply to this email directly, view it on GitHub <#3439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHFOFUQFQE6MJQLTHQMHN3XCKCHFANCNFSM6AAAAAAT7DAWCY> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Vicente Adolfo Bolea Sanchez Senior R&D Engineer | Kitware, Inc

franzpoeschel · 2023-05-08T15:10:51Z

OLCF Support just responded, it seems that single-node jobs don't initialize networking by default, making MPI_Open_Port() fail in trivial jobs. I can try the workaround tomorrow:

export MPICH_SINGLE_HOST_ENABLED=0

srun --network=single_node_vni,job_vni ...

franzpoeschel · 2023-05-09T13:23:37Z

The workaround specified by HPE (see my post above) does not help, with this configuration, the job does not even start:

 srun: error: Unable to create step for job 1316301: Error configuring interconnect

However, the hint that "by default, the launcher and MPI will not do anything to set up networking when a job-step will only run on a single host" from the email made me try a two-node job for SST-MPI streaming where both subjobs ran on both nodes. This makes MPI_Open_port() work successfully. So I guess we're now in the weird situation where multi-node jobs works, but single-node jobs don't..
(Note that the check_c_source_runs line that configures ADIOS2_SST_HAVE_MPI inside cmake/DetectOptions.cmake needs to be patched for SST-MPI even to be compiled.)

Unfortunately, streaming still does not work. The reader crashes as soon as it tries reading any data:

Reader (rank 0) requesting to read remote memory for TimeStep 0 from Rank 0, StreamWPR =0x1940950, Offset=0, Length=224
ReadRemoteMemory: Send to server, Link.CohortSize=16
Waiting for completion of memory read to rank 0, condition 4,timestep=0, is_local=0
MpiReadReplyHandler: Read recv from rank=0,condition=4,size=224
MPICH ERROR [Rank 0] [job id 1316302.0] [Tue May  9 09:09:48 2023] [frontier08135] - Abort(607203983) (rank 0 in comm 16): Fatal error in PMPI_Comm_connect: Other MPI error, error stack:
PMPI_Comm_connect(125).........: MPI_Comm_connect(port="tag#0$connentry#01425824$", MPI_INFO_NULL, root=0, MPI_COMM_SELF, newcomm=0x7ff79cbfeab8) failed
MPID_Comm_connect(202).........:
MPIDI_OFI_mpi_comm_connect(654):
dynproc_exchange_map(558)......:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

This being a closed-source custom implementation, I can't further research the error, but we'll need to get back to the support...

franzpoeschel · 2023-05-09T15:08:43Z

I have sent a report.

vicentebolea · 2023-05-09T16:05:17Z

Hi Franz, Many thanks for testing and reporting this new error, I will resume the bugfixing after this is resolved or a workaround is found. Vicente

…

On Tue, May 9, 2023 at 11:08 AM Franz Pöschel ***@***.***> wrote: I have sent a report. — Reply to this email directly, view it on GitHub <#3439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHFOFSXONXRMFDS34TMB23XFJMYNANCNFSM6AAAAAAT7DAWCY> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Vicente Adolfo Bolea Sanchez Senior R&D Engineer | Kitware, Inc

franzpoeschel · 2023-06-21T12:39:31Z

The MPI transport seems to be working again, tested by using the SST hello world examples:

$ salloc -N4 -n4 -c1 --ntasks-per-node=1 -A <project id> -t 10:00 --network=single_node_vni,job_vni
$ srun -n 2 -N 2 --network=single_node_vni,job_vni bin/hello_sstWriter_mpi > writer.out 2>&1 & 
$ srun -n 2 -N 2 --network=single_node_vni,job_vni bin/hello_sstReader_mpi > reader.out 2>&1 &
$ wait

I needed to use at least 2 MPI ranks per job since the network does not properly get initialized otherwise, and I used 4 different nodes since otherwise running asynchronous job turns into a hell of Slurm workarounds.

Writer output:

Sst set to use sockets as a Control Transport
Sst set to use sockets as a Control Transport
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
Prefered dataplane name is "mpi"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is -1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi" (preferred) for use
RDMA Dataplane unloading
MpiInitWriter initialized addr=0x44c990
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
RDMA Dataplane unloading
MpiInitWriter initialized addr=0x44c8b0
Stream "helloSst" waiting for 1 readers
Opening Stream "helloSst"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Stream "helloSst" waiting for 1 readers
Beginning writer-side reader open protocol
MPI dataplane WriterPerReader to be initialized
Beginning writer-side reader open protocol
My oldest timestep was 0, global oldest timestep was 0
MPI dataplane WriterPerReader to be initialized
Finish writer-side reader open protocol for reader 0x44d4a0, reader ready response pending
(PID 116bc, TID 7fffd3dfcf80) Waiting for Reader ready on WSR 0x44d4a0.
My oldest timestep was 0, global oldest timestep was 0
Finish writer-side reader open protocol for reader 0x44d3c0, reader ready response pending
Reader Activate message received for Stream 0x44d4a0.  Setting state to Established.
Parent stream reader count is now 1.
Reader ready on WSR 0x44d4a0, Stream established, Starting 0 LastProvided 0.
Finish opening Stream "helloSst"
Finish opening Stream "helloSst"
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Sent timestep 0 to reader cohort 0
ADDING timestep 0 to sent list for reader cohort 0, READER 0x44d3c0, reference count is now 2
SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
SstWriterClose, Sending Close at Timestep 0, one to each reader
Working on reader cohort 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Sent timestep 0 to reader cohort 0
ADDING timestep 0 to sent list for reader cohort 0, READER 0x44d4a0, reference count is now 2
Sending a message to reader 0 (0x3fce30)
SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
SstWriterClose, Sending Close at Timestep 0, one to each reader
Working on reader cohort 0
Sending a message to reader 0 (0x3fce30)
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
MpiReadRequestHandler:read request from reader=1,ts=0,off=0,len=40
MpiReadRequestHandler: Replying reader=1 with MPI port name=tag#0$connentry#0024E324$
Registering writer close handler for peer 1, CONNECTION 0x7ff79c000bf0
MpiReadRequestHandler:read request from reader=0,ts=0,off=0,len=40
MpiReadRequestHandler: Replying reader=0 with MPI port name=tag#0$connentry#0024E224$
MpiReadRequestHandler: Accepted client, Link.CohortSize=2
MpiReadRequestHandler: Accepted client, Link.CohortSize=2
Waiting for timesteps to be released in WriterClose
IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 1), Count now 1
The timesteps still queued are: 0 
Reader Count is 1
Reader [0] status is Established
Received a release timestep message for timestep 0 from reader cohort 0
Got the lock in release timestep
Doing dereference sent
Reader sent timestep list 0x456490, trying to release 0
Reader considering sent timestep 0,trying to release 0
SubRef : Writer-side Timestep 0 now has reference count 0, expired 0, precious 0
Doing QueueMaint
Reader 0 status Established has last released 0, last sent 0
QueueMaintenance, smallest last released = 0, count = 1
Writer tagging timestep 0 as expired
Releasing timestep 0
Removing dead entries
Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
Release List, TS 0
Updating reader 0 last released to 0
Release List, and set ref count of timestep 0
Reader 0 status Established has last released 0, last sent 0
QueueMaintenance, smallest last released = 0, count = 1
Writer tagging timestep 0 as expired
Releasing timestep 0
Removing dead entries
Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
QueueMaintenance complete
All timesteps are released in WriterClose
QueueMaintenance complete
Releasing the lock in release timestep
Destroying stream 0x408950, name helloSst

Stream "helloSst" (0x408a10) summary info:
Reference count now zero, Destroying process SST info cache
	Duration (secs) = 0.100517
	Timesteps Created = 1
	Timesteps Delivered = 1

All timesteps are released in WriterClose
Freeing LastCallList
SstStreamDestroy successful, returning
Reader Close message received for stream 0x44d4a0.  Setting state to PeerClosed and releasing timesteps.
In PeerFailCloseWSReader, releasing sent timesteps
Dereferencing all timesteps sent to reader 0x44d4a0
DONE DEREFERENCING
Moving Reader stream 0x44d4a0 to status PeerClosed
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Writer-side Rank received a connection-close event after close, not unexpected
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Writer-side Rank received a connection-close event after close, not unexpected
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Destroying stream 0x408a10, name helloSst
Reference count now zero, Destroying process SST info cache
Freeing LastCallList
SstStreamDestroy successful, returning

Reader output:

Sst set to use sockets as a Control Transport
Sst set to use sockets as a Control Transport
Looking for writer contact in file helloSst.sst, with timeout 60 secs
ADIOS2 SST Engine waiting for contact information file helloSst to be created
Waiting for writer DPResponse message in SstReadOpen("helloSst")
finished wait writer DPresponse message in read_open, WRITER is using "mpi" DataPlane
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
RDMA Dataplane unloading
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
Prefered dataplane name is "mpi"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is -1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi" (preferred) for use
RDMA Dataplane unloading
MPI dataplane reader initialized, reader rank 1
MPI dataplane reader initialized, reader rank 0
Sending Reader Activate messages to writer
Finish opening Stream "helloSst", starting with Step number 0
Incoming variable is of size 20
Reader rank 1 reading 10 floats starting at element 10
Waiting for writer response message in SstReadOpen("helloSst")
SstAdvanceStep returning Success on timestep 0
finished wait writer response message in read_open
Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Reader (rank 1) requesting to read remote memory for TimeStep 0 from Rank 1, StreamWPR =0x44d8d0, Offset=0, Length=40
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Registering reader close handler for peer 1 CONNECTION 0x3fb880
ReadRemoteMemory: Send to server, Link.CohortSize=2
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Waiting for completion of memory read to rank 1, condition 1,timestep=0, is_local=0
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer is using Minimum Connection Communication pattern (min)
MpiReadReplyHandler: Read recv from rank=1,condition=1,size=40
Incoming variable is of size 20
Reader rank 0 reading 10 floats starting at element 0
Sending Reader Activate messages to writer
MpiReadReplyHandler: Connecting to MPI Server
Memory read to rank 1 with condition 1 andlength 40 has completed
Finish opening Stream "helloSst", starting with Step number 0
Sending ReleaseTimestep message for timestep 0, one to each writer
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 12fab, TID 7fffd3dfcf80) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Received a writer close message. Timestep 0 was the final timestep.
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
SstAdvanceStep returning Success on timestep 0
Reader (rank 0) requesting to read remote memory for TimeStep 0 from Rank 0, StreamWPR =0x4527d0, Offset=0, Length=40
ReadRemoteMemory: Send to server, Link.CohortSize=2
Waiting for completion of memory read to rank 0, condition 4,timestep=0, is_local=0
MpiReadReplyHandler: Read recv from rank=0,condition=4,size=40
MpiReadReplyHandler: Connecting to MPI Server
Memory read to rank 0 with condition 4 andlength 40 has completed
Sending ReleaseTimestep message for timestep 0, one to each writer

Stream "helloSst" (0x3fce30) summary info:
	Duration (secs) = 0.001724
	Timestep Metadata Received = 1
	Timesteps Consumed = 1
	MetadataBytesReceived = 176 (176 bytes)
	DataBytesReceived = 80 (80 bytes)
	PreloadBytesReceived = 0 (0 bytes)
	PreloadTimestepsReceived = 0
	AverageReadRankFanIn = 1.0

Reader-side close handler invoked
Reader-side Rank received a connection-close event during normal operations, but might be part of shutdown  Don't change stream status.
The close was for connection to writer peer 1, notifying DP
received notification that writer peer 1 has failed, failing any pending requests
Destroying stream 0x3fcd70, name helloSst
Destroying stream 0x3fce30, name helloSst
Reference count now zero, Destroying process SST info cache
Reference count now zero, Destroying process SST info cache
Freeing LastCallList
SstStreamDestroy successful, returning
Read vector: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
Freeing LastCallList
SstStreamDestroy successful, returning
Read vector: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

The point that 2 ranks per task are needed at minimum can be shown with a minimal example:

#include <mpi.h>                                                                                                                                                                                                                                                                              
#include <stdlib.h>                                                                                                                                                                                                                                                                           
#include <stdio.h>                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                              
#if !defined(MPICH)                                                                                                                                                                                                                                                                           
#error "MPICH is the only supported library"                                                                                                                                                                                                                                                  
#endif                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                              
int main()                                                                                                                                                                                                                                                                                    
{                                                                                                                                                                                                                                                                                             
    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);                                                                                                                                                                                                                                   
    printf("MPICH %d.%d\n", MPI_VERSION, MPI_SUBVERSION);                                                                                                                                                                                                                                     
    MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
    MPI_Finalize();
}

$ salloc -N2 -n2 -c1 --ntasks-per-node=2 -ACSC380 -t 2:00:00 --network=single_node_vni,job_vni

$ srun -n 1 ./mpi_minimal                                                                                                                                                                                                                          
srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1                                                                                                                                                                                                                          
Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0                                                                                                                                                                                                                
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fffed4079ab]                                                                                                                                                                                                                
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fffece41bf4]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fffed0f80d8]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fffece7bef9]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fffec9eb839]                                                                                                                                                                                                                    
/autofs/nccs-svm1_home1/fpoeschel/mpi-connect/build/./mpi_minimal() [0x201b6a]                                                                                                                                                                                                                
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe9d3a29d]                                                                                                                                                                                                                                     
/autofs/nccs-svm1_home1/fpoeschel/mpi-connect/build/./mpi_minimal() [0x201a8a]                                                                                                                                                                                                                
MPICH 3.1                                                                                                                                                                                                                                                                                     
MPICH ERROR [Rank 0] [job id 1356893.0] [Wed Jun 21 08:42:45 2023] [frontier10243] - Abort(1): Internal error                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                              
srun: error: frontier10243: task 0: Exited with exit code 1                                                                                                                                                                                                                                   
srun: Terminating StepId=1356893.0       
                                                                                                                                                                                                                                                     
$ srun -n 2 ./mpi_minimal                                                                                                                                                                                                                          
MPICH 3.1                                                                                                                                                                                                                                                                                     
MPICH 3.1

vicentebolea · 2023-06-21T13:41:37Z

@franzpoeschel many thanks for notifying me and letting me know the limitations of the workaround, yesterday after my return from holidays I was able to run the MPI DP in Crusher. I will be working on fixing this scalability issue.

vicentebolea · 2023-07-07T21:26:52Z

@franzpoeschel I have noticed that if you try to run two srun jobs without salloc the MPI_Comm_connect fails. This means that whatever application running ADIOS2 will have to first run an Salloc first or be contained in a single srun invocation.

vicentebolea self-assigned this Jan 23, 2023

vicentebolea mentioned this issue Jan 26, 2023

SST_HAVE_MPI always off in source/adios2/toolkit/sst/dp/dp.c #3442

Closed

vicentebolea mentioned this issue May 2, 2023

SST with MPI plane on Frontier #3456

Closed

vicentebolea mentioned this issue Sep 4, 2023

Fix MPI Data plane cohort handling #3588

Merged

vicentebolea closed this as completed in #3588 Oct 31, 2023

MPI backend of SST: scaling issue on ORNL Crusher #3439

MPI backend of SST: scaling issue on ORNL Crusher #3439

Comments

franzpoeschel commented Jan 18, 2023

eisenhauer commented Jan 18, 2023

franzpoeschel commented Jan 18, 2023

eisenhauer commented Jan 18, 2023 • edited Loading

vicentebolea commented Jan 23, 2023

franzpoeschel commented Jan 23, 2023

franzpoeschel commented Jan 23, 2023

eisenhauer commented Jan 23, 2023

sameehj commented Jan 23, 2023

eisenhauer commented Jan 23, 2023

franzpoeschel commented Jan 24, 2023

vicentebolea commented Jan 24, 2023

franzpoeschel commented Jan 24, 2023

vicentebolea commented Jan 25, 2023

franzpoeschel commented Jan 25, 2023 • edited Loading

vicentebolea commented Jan 26, 2023

franzpoeschel commented Jan 27, 2023

franzpoeschel commented Jan 30, 2023

franzpoeschel commented Feb 7, 2023

sameehj commented Feb 16, 2023

franzpoeschel commented Feb 20, 2023

franzpoeschel commented Feb 22, 2023 • edited Loading

sameehj commented Feb 23, 2023

franzpoeschel commented Feb 23, 2023

franzpoeschel commented Feb 23, 2023

vicentebolea commented Feb 23, 2023 via email

vicentebolea commented Apr 17, 2023

franzpoeschel commented Apr 20, 2023

franzpoeschel commented Apr 20, 2023

vicentebolea commented Apr 20, 2023 via email

franzpoeschel commented Apr 20, 2023

franzpoeschel commented Apr 21, 2023

vicentebolea commented Apr 21, 2023 via email

franzpoeschel commented May 8, 2023

franzpoeschel commented May 9, 2023

franzpoeschel commented May 9, 2023

vicentebolea commented May 9, 2023 via email

franzpoeschel commented Jun 21, 2023 • edited Loading

vicentebolea commented Jun 21, 2023

vicentebolea commented Jul 7, 2023

eisenhauer commented Jan 18, 2023 •

edited

Loading

franzpoeschel commented Jan 25, 2023 •

edited

Loading

franzpoeschel commented Feb 22, 2023 •

edited

Loading

franzpoeschel commented Jun 21, 2023 •

edited

Loading