Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI backend of SST: scaling issue on ORNL Crusher #3439

Closed
franzpoeschel opened this issue Jan 18, 2023 · 39 comments · Fixed by #3588
Closed

MPI backend of SST: scaling issue on ORNL Crusher #3439

franzpoeschel opened this issue Jan 18, 2023 · 39 comments · Fixed by #3588
Assignees

Comments

@franzpoeschel
Copy link
Contributor

Describe the bug
I am trying to use the MPI backend of SST in order to couple a PIConGPU simulation with an asynchronous data sink (currently a synthetic application that just loads data and then throws it away) . Both the simulation and the data sink are parallel MPI applications.

There are 8 instances of PIConGPU running on each node (one per GPU) and additionally one CPU-only instance per node of the data sink. The data sink loads data from the simulation instances running on the same node, in order to ensure a scalable communication pattern.

I use weak scaling to scale the setup, it runs without trouble on 1 node, 8 nodes and on 12 nodes. There is no obvious performance difference. On 16 nodes, the setup will hang and be finally killed after the time limit of the job. The last message being printed by SstVerbose on the reading end is MpiReadReplyHandler: Connecting to MPI Server. The writing end seems to be blocking on the QueueFullPolicy=Block condition. The reader receives all metadata without issue, so the trouble seems to be in the data plane.

I attach the stderr log of both sides (quite big files unfortunately since they have output from all parallel instances), they include SstVerbose log
reader.err.txt
writer.err.txt

To me, this sounds more like an issue with the Cray MPI implementation that just supports a certain amount of MPI ports being open, and not like an ADIOS2 problem? Maybe it's also related to the MPI_Finalize problem described here: https://github.com/ornladios/ADIOS2/blob/master/docs/user_guide/source/advanced/ecp_hardware.rst

Are similar issues known? Have there been scaling tests on Crusher and have they been successful? Is there some parameter that I need to set?

To Reproduce
Complex setup, hard to reproduce. I don't know if a synthetic setup would show the same issue.
The engine parameters are

on the writing end:

 "QueueLimit": "1"
"DataTransport": "mpi"
"InitialBufferSize": "4Gb" // should be irrelevant due to MarshalMethod
"Profile": "Off"
"Threads": "7"
"MarshalMethod": "BP5"

on the reading end:

"DataTransport": "mpi"
"Profile": "Off"
"OpenTimeoutSecs": "6000"
"SpeculativePreloadMode": "OFF"

Expected behavior
Continued scaling as from 1 to 12 nodes.

Desktop (please complete the following information):

> module list

Currently Loaded Modules:
  1) craype-x86-trento    4) perftools-base/22.06.0                  7) tmux/3.2a   10) craype/2.7.16          13) PrgEnv-cray/8.2.0  16) craype-accel-amd-gfx90a  19) cmake/3.21.3          22) zlib/1.2.11    25) freetype/2.11.0
  2) libfabric/1.15.0.0   5) xpmem/2.4.4-2.3_11.2__gff0e1d9.shasta   8) gdb/10.2    11) cray-dsmml/0.2.2       14) xalt/1.3.0         17) rocm/5.1.0               20) boost/1.79.0-cxx17    23) git/2.35.1
  3) craype-network-ofi   6) cray-pmi/6.1.3                          9) cce/14.0.2  12) cray-libsci/22.06.1.3  15) DefApps/default    18) cray-mpich/8.1.21        21) cray-python/3.9.12.1  24) libpng/1.6.37

> export CXX=hipcc
> export CXXFLAGS="$CXXFLAGS -I${MPICH_DIR}/include"
> export LDFLAGS="$LDFLAGS -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa"
> export CFLAGS="$CXXFLAGS -I${MPICH_DIR}/include"
> cmake ..
-- The C compiler identification is Clang 14.0.6
-- The CXX compiler identification is Clang 14.0.0
-- Cray Programming Environment 2.7.16 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.16/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/rocm-5.1.0/bin/hipcc - skipped

ADIOS2 git tag 1428da5 (current master)

Additional context
I was going to try the new ucx backend as an alternative, but the cray-ucx module on Crusher has apparently not been installed correctly (build paths leaking into the pkgconfig), so I need to wait for the system support to fix that. I assume that ucx-based SST has not been tried on the system yet?

Following up

@eisenhauer
Copy link
Member

Hi Franz. I'll look into the MPI issue, but it may have to wait for @vicentebolea who's taking some vacation this month. However, I can weigh in on the UCX issues. Yes, the cray-ucx module is broken. The PR I just merged, #3437 lets you build on Crusher to use UCX despite that, by adding "-DPC_UCX_FOUND=IGNORE -DUCX_DIR=/opt/cray/pe/cray-ucx/2.7.0-1/ucx/" to your CMake line. Unfortunately the UCX data plane doesn't seem to work there after that despite building and linking properly. It looks like SST UCX needs at least version 1.9.0 (earlier versions don't have the required ucp request interface and so don't compile), but maybe it has to be newer yet. I've found that it works on a cluster with UCX 1.11. Regardless I'll put in an OLCF ticket for the UCX pkgconfig problem in the hopes that it might be better on frontier, if not on crusher.

@franzpoeschel
Copy link
Contributor Author

Ah, then I'll try the ucx backend once more with the workaround, but I suppose that it won't work for me either. I also did not get ucx to run on my local machine, so it seems to be one of those backends that only likes some systems.

It would be interesting to hear from Vicente once he's back from vacation if he was able to scale SST to a greater portion of the system yet, and if there are any tricks. Otherwise, I fear that this needs to go to the ORNL support?

@eisenhauer
Copy link
Member

eisenhauer commented Jan 18, 2023

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here:
ucx_output.txt

@vicentebolea
Copy link
Collaborator

It would be interesting to hear from Vicente once he's back from vacation if he was able to scale SST to a greater portion of the system yet, and if there are any tricks.

Hi @franzpoeschel, I was able to scale to the 100s of nodes with the MPI Dataplane. I am not sure what can be the issue with your setup. One thing that I notice is that the queuelimit=1 scale tests used a higher number such as 50 or so. queuelimit=1 might have triggered a deadlock that was missed during the scale tests. Ill have a look into that.

@vicentebolea vicentebolea self-assigned this Jan 23, 2023
@franzpoeschel
Copy link
Contributor Author

Concerning UCX on Crusher, Greg was apparently to get it to run, but he had to explicitly unload the cray-mpich module for that. I can confirm that this also ended up in a working UCX backend for me (except for many warnings on the terminal). Unfortunately, I can't seem to get PIConGPU to run in combination with the UCX-based MPICH on Crusher (error outside ADIOS2), so I can't do a production test right now.

What I noticed about the MPI data plane was that data transfer seemed surprisingly slow on Crusher. I only loaded very low amounts of data in the reader, but loading data still took sometimes more than 4 seconds, while the same setup with UCX on other systems is below half a second. Maybe there is something wrong with the MPI environment that I use?

@franzpoeschel
Copy link
Contributor Author

Hi @franzpoeschel, I was able to scale to the 100s of nodes with the MPI Dataplane. I am not sure what can be the issue with your setup. One thing that I notice is that the queuelimit=1 scale tests used a higher number such as 50 or so. queuelimit=1 might have triggered a deadlock that was missed during the scale tests. Ill have a look into that.

How many writers and readers were there in your setup per node? I can try specifying queuelimit=0 and see if this changes things.

@eisenhauer
Copy link
Member

@franzpoeschel I'm not sure we've got performance numbers comparing the MPI dataplane to a working RDMA dataplane on a HPC machine. I'll try to see if I can run some things on Crusher so we can evaluate...

@sameehj
Copy link
Contributor

sameehj commented Jan 23, 2023

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

@eisenhauer
Copy link
Member

Hi Sameeh. So in the intervening time we have managed to get the UCX dataplane working on Crusher. We had to explicitly unload the cray-mpich module and then load cray-mpich-ucx. I have not yet done real performance runs, but at least things work (and this is with UCX 1.9.0).

@franzpoeschel
Copy link
Contributor Author

I think I might have found the issue. Looking into the SstVerbose log again, I missed the fact that the MPI data plane was not even found..
It seems that this commit 34e6160 removed the set(ADIOS2_SST_HAVE_MPI TRUE) line from the cmake/DetectOptions.cmake file, so ADIOS2 does not consider the MPI data plane. I updated the ADIOS2 version for UCX support and probably since then, I never actually used the MPI data plane any longer. I'll try out if this fixes it.

@vicentebolea
Copy link
Collaborator

How many writers and readers were there in your setup per node?

Scale tests went around 100 writers and 100 readers.

@franzpoeschel
Copy link
Contributor Author

I think I might have found the issue.

No luck, unfortunately. It seems that I first ran into this issue when the MPI data plane was correctly loading, I now removed the #ifdef HAVE_SST_MPI lines and the MPI data plane is definitely loading according to the log, but the issue persists.

Scale tests went around 100 writers and 100 readers.

This means that each node hosts one writer and one reader?
This might actually be too small to trigger the issue. My test runs fine on 12 nodes, that is 96 writers (8 per node) and 12 readers (1 per node).
I see the hangup happening on 16 nodes, i.e. 128 writers and 16 readers.

I now tested setting QueueLimit=0. The consequence is that the simulation runs to completion (I am running a rather small simulation, so the extra memory needed by the queue is no issue), but the reader still hangs at the first step.

@vicentebolea
Copy link
Collaborator

It seems that this commit 34e6160 removed the set(ADIOS2_SST_HAVE_MPI TRUE) line from the cmake/DetectOptions.cmake file, so ADIOS2 does not consider the MPI data plane

Note that ADIOS2_SST_HAVE_MPI is not used anymore. However, it does not mean that it will disable the MPI data plane, what happens is that:

  • The MPI dataplane will be disabled if the following test build/run fails:
        #include <mpi.h>
        #include <stdlib.h>

        #if !defined(MPICH)
        #error "MPICH is the only supported library"
        #endif

        int main()
        {
          MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
          MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
          MPI_Finalize();
        }

As for the persistence of the issue, can you make sure that both the client and the servers are using the MPI DP?

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jan 25, 2023

Note that ADIOS2_SST_HAVE_MPI is not used anymore. However, it does not mean that it will disable the MPI data plane, what happens is that:

There are two #ifdef SST_HAVE_MPI blocks in source/adios2/toolkit/sst/dp/dp.c that don't get activated in my builds any more, and I find no place in the build system where that macro would be activated.

After removing these ifdefs, the log says that both ends are using the MPI dataplane:
Writer side:

Opening Stream "openPMD/simData"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Stream "openPMD/simData" waiting for 1 readers
MpiInitWriter initialized addr=0x1b44070

Reader side:

Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable)
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Sending Reader Activate messages to writer

Full writer log pic.err.txt
Full reader log pipe.err.txt

The exact same setup is running fine with the MPI dataplane at a lower node count.

My suspicion is that this is not really an ADIOS2 issue, but rather an issue with the scalability of the MPI_Port_open/MPI_Port_accept functionality on Crusher.

@vicentebolea
Copy link
Collaborator

There are two #ifdef SST_HAVE_MPI blocks in source/adios2/toolkit/sst/dp/dp.c that don't get activated in my builds any more, and I find no place in the build system where that macro would be activated.

This is correct, I accidentally introduced this regression in #3407

@franzpoeschel
Copy link
Contributor Author

The scaling issue seemingly depends not only on the number of MPI tasks, but also on the number of loaded chunks. The PIConGPU simulation that I use writes 32 ADIOS2 variables, each rank writing one chunk. In my test so far, the reader requested all chunks written from ranks on the same node (i.e. 32 variables * 8 writers on the same node).
(Additionally at the beginning of the simulation, there is one variable written only by rank 0, but read by everyone)

As I am only interested in particle data, I restricted the loading procedures to only 20 of the 32 variables, and now there is no hangup at 16 nodes. I will test if this setup will hang at a higher node count.

@franzpoeschel
Copy link
Contributor Author

The new setup hangs at 128 nodes on Crusher, still running fine at 64 nodes. This might help me come up with an ADIOS2-only minimal example to reproduce the issue.

@franzpoeschel
Copy link
Contributor Author

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

Hello @sameehj, I have generated this data now. Streaming with SST-UCX works without issue within a single-node job. In order to run a multi-node job, specifying UCX_TLS=all is necessary, otherwise even MPI_Init will fail (using UCX requires using the UCX-backed MPI implementation. As Greg says, extant but somewhat broken).
However, SST-UCX hangs on two nodes even with this setting.
The zipfile below has the stdout and stderr of an SST writer and an SST reader, executed successfully on one node and without success on two nodes. The stderr log shows the SstVerbose log, the stdout log has the UCX log.

output.zip

@sameehj
Copy link
Contributor

sameehj commented Feb 16, 2023

@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: ucx_output.txt

@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please?

Hello @sameehj, I have generated this data now. Streaming with SST-UCX works without issue within a single-node job. In order to run a multi-node job, specifying UCX_TLS=all is necessary, otherwise even MPI_Init will fail (using UCX requires using the UCX-backed MPI implementation. As Greg says, extant but somewhat broken). However, SST-UCX hangs on two nodes even with this setting. The zipfile below has the stdout and stderr of an SST writer and an SST reader, executed successfully on one node and without success on two nodes. The stderr log shows the SstVerbose log, the stdout log has the UCX log.

output.zip

hmm rather odd, I took a look at your files

Interesting, I suspect two things:

  1. SpeculativePreload set to Auto, Can you set it to off?
  2. I set UCX_TLS to the specific rdma transports I want to use. i.e. UCX_TLS=ud_mlx,rc_mlx.

I don't think your UCX logs are visible in the two nodes case, we should be able to see the UCX detailed logs. are you sure you specified UCX_LOG_LEVEL=data?

best regards,
Sameeh

@franzpoeschel
Copy link
Contributor Author

Thank you for the help, @sameehj
The SpeculativePreloadMode is interpreted only by the reader where it's turned off. I tried specifying it for the writer, too, but it did not help.
So far, I specified UCX_TLS=all. Trying to use UCX_TLS=rc,ud leads to the following error:

ucp_context.c:731  UCX  WARN  transports 'rc','ud' are not available, please use one or more of: cma, mm, posix, self, shm, sm, sysv, tcp, xpmem

I guess, without these transports available that using UCX on that system will have no use? Or is there a transport among these that has any merit trying? I think that UCX selected tcp so far.

I checked if I specified UCX_LOG_LEVEL=data in both setups, it seems that I did. It writes some output to the .out files, but not much.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Feb 22, 2023

I think that I'm starting to narrow down the issue with the MPI transport. The problem is twofold. Say that n is the count of writers, m the count of readers:

  • n -> 1 communication patterns. We have datasets to which every rank contributes some few items, that are loaded only on rank 0 of the reader. The end of the log is:
     0: MpiReadReplyHandler: Read recv from rank=126,condition=1538,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=1666,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=1794,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=1922,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=2050,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=2178,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=2306,size=8
     0: MpiReadReplyHandler: Read recv from rank=126,condition=2434,size=8
     0: MpiReadReplyHandler: Read recv from rank=127,condition=434,size=8
     0: MpiReadReplyHandler: Connecting to MPI Server
    
    (The "from rank"s are counting up from 0, all log messages are from reading rank 0)
  • 1 -> m communication patterns. We have some datasets that are needed on every reading rank, but that are written in their entirety by rank 0. The end of the log is
    26: MpiReadReplyHandler: Connecting to MPI Server
    26: Memory read to rank 0 with condition 1 andlength 11264 has completed
    110: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264
    110: MpiReadReplyHandler: Connecting to MPI Server
    110: Memory read to rank 0 with condition 1 andlength 11264 has completed
    102: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264
    102: MpiReadReplyHandler: Connecting to MPI Server
    102: Memory read to rank 0 with condition 1 andlength 11264 has completed
    30: MpiReadReplyHandler: Read recv from rank=0,condition=1,size=11264
    
    The "has completed" message for rank 30 never appears.

Both issues cause hangups, the first at 16 nodes (n=168, m=16), the second at 128 nodes (n=1288, m=128). I have now implemented workarounds that use MPI_Gather -> rank 0 -> rank 0 -> MPI_BCast.

I'll try to adapt my reproducer to this.

@sameehj
Copy link
Contributor

sameehj commented Feb 23, 2023

I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck.

@franzpoeschel
Copy link
Contributor Author

reproducer.zip

This reproducer reproduces both issues mentioned above on the Crusher system. n->1 and 1->m communication patterns are not scalable on the system, using the MPI transport. I don't know if they should be, and neither do I know if this points out other scaling issues that might come up at full Frontier scale. If I remember correctly, neither communication pattern was an issue with libfabric on Summit.

The reproducer very closely resembles the IO patterns of PIConGPU. It uses (the metadata of) a BP4 dataset written by PIConGPU on 128 nodes as basis for creating an SST stream. A second C++ code reads the code, triggering both issues.
The actual bulk of the reading works well enough, the scaling issue is with a handful of smaller datasets that are exchanged in n->1 or 1->m patterns.

The ZIP file contains:

  1. The metadata part of the BP output of PIConGPU
  2. A synthetic writer and reader that resembles the IO workflow of PIConGPU, and a CMakeLists.txt for compiling it
  3. A README.md that explains how to set this up and what issues can be observed how
  4. A submit.sh batch script
> bpls simData_00000.bp
  float     /data/0/fields/B/x                                      {256, 2048, 768}
  float     /data/0/fields/B/y                                      {256, 2048, 768}
  float     /data/0/fields/B/z                                      {256, 2048, 768}
  float     /data/0/fields/E/x                                      {256, 2048, 768}
  float     /data/0/fields/E/y                                      {256, 2048, 768}
  float     /data/0/fields/E/z                                      {256, 2048, 768}
  float     /data/0/fields/e_all_chargeDensity                      {256, 2048, 768}
  float     /data/0/fields/e_all_energyDensity                      {256, 2048, 768}
  float     /data/0/fields/i_all_chargeDensity                      {256, 2048, 768}
  float     /data/0/fields/i_all_energyDensity                      {256, 2048, 768}
  uint64_t  /data/0/fields/picongpu_idProvider/nextId               {8, 16, 8}
  uint64_t  /data/0/fields/picongpu_idProvider/startId              {8, 16, 8}
  float     /data/0/particles/e/momentum/x                          {10066329600}
  float     /data/0/particles/e/momentum/y                          {10066329600}
  float     /data/0/particles/e/momentum/z                          {10066329600}
  uint64_t  /data/0/particles/e/particlePatches/extent/x            {1024}
  uint64_t  /data/0/particles/e/particlePatches/extent/y            {1024}
  uint64_t  /data/0/particles/e/particlePatches/extent/z            {1024}
  uint64_t  /data/0/particles/e/particlePatches/numParticles        {1024}
  uint64_t  /data/0/particles/e/particlePatches/numParticlesOffset  {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/x            {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/y            {1024}
  uint64_t  /data/0/particles/e/particlePatches/offset/z            {1024}
  float     /data/0/particles/e/position/x                          {10066329600}
  float     /data/0/particles/e/position/y                          {10066329600}
  float     /data/0/particles/e/position/z                          {10066329600}
  int32_t   /data/0/particles/e/positionOffset/x                    {10066329600}
  int32_t   /data/0/particles/e/positionOffset/y                    {10066329600}
  int32_t   /data/0/particles/e/positionOffset/z                    {10066329600}
  float     /data/0/particles/e/weighting                           {10066329600}
  float     /data/0/particles/i/momentum/x                          {10066329600}
  float     /data/0/particles/i/momentum/y                          {10066329600}
  float     /data/0/particles/i/momentum/z                          {10066329600}
  uint64_t  /data/0/particles/i/particlePatches/extent/x            {1024}
  uint64_t  /data/0/particles/i/particlePatches/extent/y            {1024}
  uint64_t  /data/0/particles/i/particlePatches/extent/z            {1024}
  uint64_t  /data/0/particles/i/particlePatches/numParticles        {1024}
  uint64_t  /data/0/particles/i/particlePatches/numParticlesOffset  {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/x            {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/y            {1024}
  uint64_t  /data/0/particles/i/particlePatches/offset/z            {1024}
  float     /data/0/particles/i/position/x                          {10066329600}
  float     /data/0/particles/i/position/y                          {10066329600}
  float     /data/0/particles/i/position/z                          {10066329600}
  int32_t   /data/0/particles/i/positionOffset/x                    {10066329600}
  int32_t   /data/0/particles/i/positionOffset/y                    {10066329600}
  int32_t   /data/0/particles/i/positionOffset/z                    {10066329600}
  float     /data/0/particles/i/weighting                           {10066329600}

@franzpoeschel
Copy link
Contributor Author

I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck.

We have received info from the OLCF support that the UCX module on Crusher is not really supposed to be used. Since SST also has a direct TCP backend, I don't know if it would really be worth it to further try debugging this now.

@vicentebolea
Copy link
Collaborator

vicentebolea commented Feb 23, 2023 via email

@vicentebolea
Copy link
Collaborator

@franzpoeschel I think I might know what can be the issue, however, I am having troubles to run your reproducer in Crusher, also not only your reproducer but any sample application using mpi client/server routines. I wonder if you could confirm that your reproducer still works.

@franzpoeschel
Copy link
Contributor Author

I tried running SST+MPI on Frontier right now. I get an assertion error when trying to run this from inside MPI_Open_port, so it looks like that is indeed broken on the system.

Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7f73d22249ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7f73d1c5ebf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7f73d1f150d8]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7f73d1c98ef9]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7f73d1808839]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(+0x7b1f15) [0x7f73c597ef15]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(WriterParticipateInReaderOpen+0x206) [0x7f73c596bf86]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(SstWriterOpen+0x23e) [0x7f73c596cc5e]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::engine::SstWriter::SstWriter(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0xd7) [0x7f73c58ed647]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(std::shared_ptr<adios2::core::Engine> adios2::core::IO::MakeEngine<adios2::core::engine::SstWriter>(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x66) [0x7f73c54b12b6]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core_mpi.so.2(std::_Function_handler<std::shared_ptr<adios2::core::Engine> (adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm), std::shared_ptr<adios2::core::Engine> (*)(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)>::_M_invoke(std::_Any_data const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode&&, adios2::helper::Comm&&)+0x39) [0x7f73c59e8609]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode, adios2::helper::Comm)+0x843) [0x7f73c5486333]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_core.so.2(adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0x50) [0x7f73c5487ad0]
/ccs/home/fpoeschel/frontier_env/local/lib64/libadios2_cxx11.so.2(adios2::IO::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, adios2::Mode)+0xdf) [0x7f73d3fb8b2f]

@franzpoeschel
Copy link
Contributor Author

Yep, even the minimal example fails:

#include <mpi.h>
#include <stdlib.h>

#if !defined(MPICH)
#error "MPICH is the only supported library"
#endif

int main()
{
    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
    MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
    MPI_Finalize();
}

--->

> ./mpi_minimal 
Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fca4c6499ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fca4c083bf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fca4c33a0d8]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fca4c0bdef9]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fca4bc2d839]
./mpi_minimal() [0x2018c8]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fca4965129d]
./mpi_minimal() [0x2017ea]
MPICH ERROR [Rank 0] [job id ] [Thu Apr 20 09:03:20 2023] [login03] - Abort(1): Internal error

@vicentebolea
Copy link
Collaborator

vicentebolea commented Apr 20, 2023 via email

@franzpoeschel
Copy link
Contributor Author

It was Frontier where I saw this. I wanted to try some things tomorrow, and if those don't help, we definitely need to report.

@franzpoeschel
Copy link
Contributor Author

I just sent a report, you are in CC

@vicentebolea
Copy link
Collaborator

vicentebolea commented Apr 21, 2023 via email

@franzpoeschel
Copy link
Contributor Author

OLCF Support just responded, it seems that single-node jobs don't initialize networking by default, making MPI_Open_Port() fail in trivial jobs. I can try the workaround tomorrow:

export MPICH_SINGLE_HOST_ENABLED=0

srun --network=single_node_vni,job_vni ...

@franzpoeschel
Copy link
Contributor Author

The workaround specified by HPE (see my post above) does not help, with this configuration, the job does not even start:

 srun: error: Unable to create step for job 1316301: Error configuring interconnect

However, the hint that "by default, the launcher and MPI will not do anything to set up networking when a job-step will only run on a single host" from the email made me try a two-node job for SST-MPI streaming where both subjobs ran on both nodes. This makes MPI_Open_port() work successfully. So I guess we're now in the weird situation where multi-node jobs works, but single-node jobs don't..
(Note that the check_c_source_runs line that configures ADIOS2_SST_HAVE_MPI inside cmake/DetectOptions.cmake needs to be patched for SST-MPI even to be compiled.)

Unfortunately, streaming still does not work. The reader crashes as soon as it tries reading any data:

Reader (rank 0) requesting to read remote memory for TimeStep 0 from Rank 0, StreamWPR =0x1940950, Offset=0, Length=224
ReadRemoteMemory: Send to server, Link.CohortSize=16
Waiting for completion of memory read to rank 0, condition 4,timestep=0, is_local=0
MpiReadReplyHandler: Read recv from rank=0,condition=4,size=224
MPICH ERROR [Rank 0] [job id 1316302.0] [Tue May  9 09:09:48 2023] [frontier08135] - Abort(607203983) (rank 0 in comm 16): Fatal error in PMPI_Comm_connect: Other MPI error, error stack:
PMPI_Comm_connect(125).........: MPI_Comm_connect(port="tag#0$connentry#01425824$", MPI_INFO_NULL, root=0, MPI_COMM_SELF, newcomm=0x7ff79cbfeab8) failed
MPID_Comm_connect(202).........:
MPIDI_OFI_mpi_comm_connect(654):
dynproc_exchange_map(558)......:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

This being a closed-source custom implementation, I can't further research the error, but we'll need to get back to the support...

@franzpoeschel
Copy link
Contributor Author

I have sent a report.

@vicentebolea
Copy link
Collaborator

vicentebolea commented May 9, 2023 via email

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jun 21, 2023

The MPI transport seems to be working again, tested by using the SST hello world examples:

$ salloc -N4 -n4 -c1 --ntasks-per-node=1 -A <project id> -t 10:00 --network=single_node_vni,job_vni
$ srun -n 2 -N 2 --network=single_node_vni,job_vni bin/hello_sstWriter_mpi > writer.out 2>&1 & 
$ srun -n 2 -N 2 --network=single_node_vni,job_vni bin/hello_sstReader_mpi > reader.out 2>&1 &
$ wait

I needed to use at least 2 MPI ranks per job since the network does not properly get initialized otherwise, and I used 4 different nodes since otherwise running asynchronous job turns into a hell of Slurm workarounds.

Writer output:

Sst set to use sockets as a Control Transport
Sst set to use sockets as a Control Transport
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
Prefered dataplane name is "mpi"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is -1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi" (preferred) for use
RDMA Dataplane unloading
MpiInitWriter initialized addr=0x44c990
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
RDMA Dataplane unloading
MpiInitWriter initialized addr=0x44c8b0
Stream "helloSst" waiting for 1 readers
Opening Stream "helloSst"
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Stream "helloSst" waiting for 1 readers
Beginning writer-side reader open protocol
MPI dataplane WriterPerReader to be initialized
Beginning writer-side reader open protocol
My oldest timestep was 0, global oldest timestep was 0
MPI dataplane WriterPerReader to be initialized
Finish writer-side reader open protocol for reader 0x44d4a0, reader ready response pending
(PID 116bc, TID 7fffd3dfcf80) Waiting for Reader ready on WSR 0x44d4a0.
My oldest timestep was 0, global oldest timestep was 0
Finish writer-side reader open protocol for reader 0x44d3c0, reader ready response pending
Reader Activate message received for Stream 0x44d4a0.  Setting state to Established.
Parent stream reader count is now 1.
Reader ready on WSR 0x44d4a0, Stream established, Starting 0 LastProvided 0.
Finish opening Stream "helloSst"
Finish opening Stream "helloSst"
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Sent timestep 0 to reader cohort 0
ADDING timestep 0 to sent list for reader cohort 0, READER 0x44d3c0, reference count is now 2
SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
SstWriterClose, Sending Close at Timestep 0, one to each reader
Working on reader cohort 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Sent timestep 0 to reader cohort 0
ADDING timestep 0 to sent list for reader cohort 0, READER 0x44d4a0, reference count is now 2
Sending a message to reader 0 (0x3fce30)
SubRef : Writer-side Timestep 0 now has reference count 1, expired 0, precious 0
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
SstWriterClose, Sending Close at Timestep 0, one to each reader
Working on reader cohort 0
Sending a message to reader 0 (0x3fce30)
Reader 0 status Established has last released 4294967295, last sent 0
QueueMaintenance, smallest last released = -1, count = 1
Removing dead entries
QueueMaintenance complete
MpiReadRequestHandler:read request from reader=1,ts=0,off=0,len=40
MpiReadRequestHandler: Replying reader=1 with MPI port name=tag#0$connentry#0024E324$
Registering writer close handler for peer 1, CONNECTION 0x7ff79c000bf0
MpiReadRequestHandler:read request from reader=0,ts=0,off=0,len=40
MpiReadRequestHandler: Replying reader=0 with MPI port name=tag#0$connentry#0024E224$
MpiReadRequestHandler: Accepted client, Link.CohortSize=2
MpiReadRequestHandler: Accepted client, Link.CohortSize=2
Waiting for timesteps to be released in WriterClose
IN TS WAIT, ENTRIES are Timestep 0 (exp 0, Prec 0, Ref 1), Count now 1
The timesteps still queued are: 0 
Reader Count is 1
Reader [0] status is Established
Received a release timestep message for timestep 0 from reader cohort 0
Got the lock in release timestep
Doing dereference sent
Reader sent timestep list 0x456490, trying to release 0
Reader considering sent timestep 0,trying to release 0
SubRef : Writer-side Timestep 0 now has reference count 0, expired 0, precious 0
Doing QueueMaint
Reader 0 status Established has last released 0, last sent 0
QueueMaintenance, smallest last released = 0, count = 1
Writer tagging timestep 0 as expired
Releasing timestep 0
Removing dead entries
Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
Release List, TS 0
Updating reader 0 last released to 0
Release List, and set ref count of timestep 0
Reader 0 status Established has last released 0, last sent 0
QueueMaintenance, smallest last released = 0, count = 1
Writer tagging timestep 0 as expired
Releasing timestep 0
Removing dead entries
Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 0
QueueMaintenance complete
All timesteps are released in WriterClose
QueueMaintenance complete
Releasing the lock in release timestep
Destroying stream 0x408950, name helloSst

Stream "helloSst" (0x408a10) summary info:
Reference count now zero, Destroying process SST info cache
	Duration (secs) = 0.100517
	Timesteps Created = 1
	Timesteps Delivered = 1

All timesteps are released in WriterClose
Freeing LastCallList
SstStreamDestroy successful, returning
Reader Close message received for stream 0x44d4a0.  Setting state to PeerClosed and releasing timesteps.
In PeerFailCloseWSReader, releasing sent timesteps
Dereferencing all timesteps sent to reader 0x44d4a0
DONE DEREFERENCING
Moving Reader stream 0x44d4a0 to status PeerClosed
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Writer-side Rank received a connection-close event after close, not unexpected
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Writer-side Rank received a connection-close event after close, not unexpected
Reader 0 status PeerClosed has last released 0, last sent 0
QueueMaintenance, smallest last released = LONG_MAX, count = 0
Removing dead entries
QueueMaintenance complete
Destroying stream 0x408a10, name helloSst
Reference count now zero, Destroying process SST info cache
Freeing LastCallList
SstStreamDestroy successful, returning

Reader output:

Sst set to use sockets as a Control Transport
Sst set to use sockets as a Control Transport
Looking for writer contact in file helloSst.sst, with timeout 60 secs
ADIOS2 SST Engine waiting for contact information file helloSst to be created
Waiting for writer DPResponse message in SstReadOpen("helloSst")
finished wait writer DPresponse message in read_open, WRITER is using "mpi" DataPlane
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
RDMA Dataplane unloading
RDMA Dataplane could not find an RDMA-compatible fabric.
RDMA Dataplane evaluating viability, returning priority -1
Prefered dataplane name is "mpi"
Considering DataPlane "evpath" for possible use, priority is 1
Considering DataPlane "rdma" for possible use, priority is -1
Considering DataPlane "mpi" for possible use, priority is 100
Selecting DataPlane "mpi" (preferred) for use
RDMA Dataplane unloading
MPI dataplane reader initialized, reader rank 1
MPI dataplane reader initialized, reader rank 0
Sending Reader Activate messages to writer
Finish opening Stream "helloSst", starting with Step number 0
Incoming variable is of size 20
Reader rank 1 reading 10 floats starting at element 10
Waiting for writer response message in SstReadOpen("helloSst")
SstAdvanceStep returning Success on timestep 0
finished wait writer response message in read_open
Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Reader (rank 1) requesting to read remote memory for TimeStep 0 from Rank 1, StreamWPR =0x44d8d0, Offset=0, Length=40
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsAllToAll
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Registering reader close handler for peer 1 CONNECTION 0x3fb880
ReadRemoteMemory: Send to server, Link.CohortSize=2
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=mpi
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Waiting for completion of memory read to rank 1, condition 1,timestep=0, is_local=0
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer is using Minimum Connection Communication pattern (min)
MpiReadReplyHandler: Read recv from rank=1,condition=1,size=40
Incoming variable is of size 20
Reader rank 0 reading 10 floats starting at element 0
Sending Reader Activate messages to writer
MpiReadReplyHandler: Connecting to MPI Server
Memory read to rank 1 with condition 1 andlength 40 has completed
Finish opening Stream "helloSst", starting with Step number 0
Sending ReleaseTimestep message for timestep 0, one to each writer
Wait for next metadata after last timestep -1
Waiting for metadata for a Timestep later than TS -1
(PID 12fab, TID 7fffd3dfcf80) Stream status is Established
Received a Timestep metadata message for timestep 0, signaling condition
Received a writer close message. Timestep 0 was the final timestep.
Examining metadata for Timestep 0
Returning metadata for Timestep 0
Setting TSmsg to Rootentry value
SstAdvanceStep returning Success on timestep 0
Reader (rank 0) requesting to read remote memory for TimeStep 0 from Rank 0, StreamWPR =0x4527d0, Offset=0, Length=40
ReadRemoteMemory: Send to server, Link.CohortSize=2
Waiting for completion of memory read to rank 0, condition 4,timestep=0, is_local=0
MpiReadReplyHandler: Read recv from rank=0,condition=4,size=40
MpiReadReplyHandler: Connecting to MPI Server
Memory read to rank 0 with condition 4 andlength 40 has completed
Sending ReleaseTimestep message for timestep 0, one to each writer

Stream "helloSst" (0x3fce30) summary info:
	Duration (secs) = 0.001724
	Timestep Metadata Received = 1
	Timesteps Consumed = 1
	MetadataBytesReceived = 176 (176 bytes)
	DataBytesReceived = 80 (80 bytes)
	PreloadBytesReceived = 0 (0 bytes)
	PreloadTimestepsReceived = 0
	AverageReadRankFanIn = 1.0

Reader-side close handler invoked
Reader-side Rank received a connection-close event during normal operations, but might be part of shutdown  Don't change stream status.
The close was for connection to writer peer 1, notifying DP
received notification that writer peer 1 has failed, failing any pending requests
Destroying stream 0x3fcd70, name helloSst
Destroying stream 0x3fce30, name helloSst
Reference count now zero, Destroying process SST info cache
Reference count now zero, Destroying process SST info cache
Freeing LastCallList
SstStreamDestroy successful, returning
Read vector: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
Freeing LastCallList
SstStreamDestroy successful, returning
Read vector: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 

The point that 2 ranks per task are needed at minimum can be shown with a minimal example:

#include <mpi.h>                                                                                                                                                                                                                                                                              
#include <stdlib.h>                                                                                                                                                                                                                                                                           
#include <stdio.h>                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                              
#if !defined(MPICH)                                                                                                                                                                                                                                                                           
#error "MPICH is the only supported library"                                                                                                                                                                                                                                                  
#endif                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                              
int main()                                                                                                                                                                                                                                                                                    
{                                                                                                                                                                                                                                                                                             
    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);                                                                                                                                                                                                                                   
    printf("MPICH %d.%d\n", MPI_VERSION, MPI_SUBVERSION);                                                                                                                                                                                                                                     
    MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
    MPI_Finalize();
}
$ salloc -N2 -n2 -c1 --ntasks-per-node=2 -ACSC380 -t 2:00:00 --network=single_node_vni,job_vni

$ srun -n 1 ./mpi_minimal                                                                                                                                                                                                                          
srun: warning: can't run 1 processes on 2 nodes, setting nnodes to 1                                                                                                                                                                                                                          
Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0                                                                                                                                                                                                                
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fffed4079ab]                                                                                                                                                                                                                
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fffece41bf4]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fffed0f80d8]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fffece7bef9]                                                                                                                                                                                                                             
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fffec9eb839]                                                                                                                                                                                                                    
/autofs/nccs-svm1_home1/fpoeschel/mpi-connect/build/./mpi_minimal() [0x201b6a]                                                                                                                                                                                                                
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fffe9d3a29d]                                                                                                                                                                                                                                     
/autofs/nccs-svm1_home1/fpoeschel/mpi-connect/build/./mpi_minimal() [0x201a8a]                                                                                                                                                                                                                
MPICH 3.1                                                                                                                                                                                                                                                                                     
MPICH ERROR [Rank 0] [job id 1356893.0] [Wed Jun 21 08:42:45 2023] [frontier10243] - Abort(1): Internal error                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                              
srun: error: frontier10243: task 0: Exited with exit code 1                                                                                                                                                                                                                                   
srun: Terminating StepId=1356893.0       
                                                                                                                                                                                                                                                     
$ srun -n 2 ./mpi_minimal                                                                                                                                                                                                                          
MPICH 3.1                                                                                                                                                                                                                                                                                     
MPICH 3.1                                                                 

@vicentebolea
Copy link
Collaborator

@franzpoeschel many thanks for notifying me and letting me know the limitations of the workaround, yesterday after my return from holidays I was able to run the MPI DP in Crusher. I will be working on fixing this scalability issue.

@vicentebolea
Copy link
Collaborator

@franzpoeschel I have noticed that if you try to run two srun jobs without salloc the MPI_Comm_connect fails. This means that whatever application running ADIOS2 will have to first run an Salloc first or be contained in a single srun invocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants