-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI backend of SST: scaling issue on ORNL Crusher #3439
Comments
Hi Franz. I'll look into the MPI issue, but it may have to wait for @vicentebolea who's taking some vacation this month. However, I can weigh in on the UCX issues. Yes, the cray-ucx module is broken. The PR I just merged, #3437 lets you build on Crusher to use UCX despite that, by adding "-DPC_UCX_FOUND=IGNORE -DUCX_DIR=/opt/cray/pe/cray-ucx/2.7.0-1/ucx/" to your CMake line. Unfortunately the UCX data plane doesn't seem to work there after that despite building and linking properly. It looks like SST UCX needs at least version 1.9.0 (earlier versions don't have the required ucp request interface and so don't compile), but maybe it has to be newer yet. I've found that it works on a cluster with UCX 1.11. Regardless I'll put in an OLCF ticket for the UCX pkgconfig problem in the hopes that it might be better on frontier, if not on crusher. |
Ah, then I'll try the ucx backend once more with the workaround, but I suppose that it won't work for me either. I also did not get ucx to run on my local machine, so it seems to be one of those backends that only likes some systems. It would be interesting to hear from Vicente once he's back from vacation if he was able to scale SST to a greater portion of the system yet, and if there are any tricks. Otherwise, I fear that this needs to go to the ORNL support? |
@sameehj Are you aware of a specific minimum version that might be required for the UCX data plane? We've inferred that it needs to be 1.9.0 or better simply because prior versions don't have the request API and won't compile. However, we're trying the dataplane on Crusher which has 1.9.0 (poorly installed, but extant) and we're not seeing completions for RDMA read operations. SstVerbose=5 output for a simple test is here: |
Hi @franzpoeschel, I was able to scale to the 100s of nodes with the MPI Dataplane. I am not sure what can be the issue with your setup. One thing that I notice is that the |
Concerning UCX on Crusher, Greg was apparently to get it to run, but he had to explicitly unload the What I noticed about the MPI data plane was that data transfer seemed surprisingly slow on Crusher. I only loaded very low amounts of data in the reader, but loading data still took sometimes more than 4 seconds, while the same setup with UCX on other systems is below half a second. Maybe there is something wrong with the MPI environment that I use? |
How many writers and readers were there in your setup per node? I can try specifying queuelimit=0 and see if this changes things. |
@franzpoeschel I'm not sure we've got performance numbers comparing the MPI dataplane to a working RDMA dataplane on a HPC machine. I'll try to see if I can run some things on Crusher so we can evaluate... |
@eisenhauer, Sorry just saw this. No I don't know of any limitation on a specific version needed. But I have mainly tested with 1.11. Can you share the logs with UCX_LOG_LEVEL=data please? |
Hi Sameeh. So in the intervening time we have managed to get the UCX dataplane working on Crusher. We had to explicitly unload the cray-mpich module and then load cray-mpich-ucx. I have not yet done real performance runs, but at least things work (and this is with UCX 1.9.0). |
I think I might have found the issue. Looking into the SstVerbose log again, I missed the fact that the MPI data plane was not even found.. |
Scale tests went around 100 writers and 100 readers. |
No luck, unfortunately. It seems that I first ran into this issue when the MPI data plane was correctly loading, I now removed the
This means that each node hosts one writer and one reader? I now tested setting |
Note that
As for the persistence of the issue, can you make sure that both the client and the servers are using the MPI DP? |
There are two After removing these ifdefs, the log says that both ends are using the MPI dataplane:
Reader side:
Full writer log pic.err.txt The exact same setup is running fine with the MPI dataplane at a lower node count. My suspicion is that this is not really an ADIOS2 issue, but rather an issue with the scalability of the MPI_Port_open/MPI_Port_accept functionality on Crusher. |
This is correct, I accidentally introduced this regression in #3407 |
The scaling issue seemingly depends not only on the number of MPI tasks, but also on the number of loaded chunks. The PIConGPU simulation that I use writes 32 ADIOS2 variables, each rank writing one chunk. In my test so far, the reader requested all chunks written from ranks on the same node (i.e. 32 variables * 8 writers on the same node). As I am only interested in particle data, I restricted the loading procedures to only 20 of the 32 variables, and now there is no hangup at 16 nodes. I will test if this setup will hang at a higher node count. |
The new setup hangs at 128 nodes on Crusher, still running fine at 64 nodes. This might help me come up with an ADIOS2-only minimal example to reproduce the issue. |
Hello @sameehj, I have generated this data now. Streaming with SST-UCX works without issue within a single-node job. In order to run a multi-node job, specifying |
hmm rather odd, I took a look at your files Interesting, I suspect two things:
I don't think your UCX logs are visible in the two nodes case, we should be able to see the UCX detailed logs. are you sure you specified UCX_LOG_LEVEL=data? best regards, |
Thank you for the help, @sameehj
I guess, without these transports available that using UCX on that system will have no use? Or is there a transport among these that has any merit trying? I think that UCX selected tcp so far. I checked if I specified UCX_LOG_LEVEL=data in both setups, it seems that I did. It writes some output to the |
I think that I'm starting to narrow down the issue with the MPI transport. The problem is twofold. Say that n is the count of writers, m the count of readers:
Both issues cause hangups, the first at 16 nodes (n=168, m=16), the second at 128 nodes (n=1288, m=128). I have now implemented workarounds that use MPI_Gather -> rank 0 -> rank 0 -> MPI_BCast. I'll try to adapt my reproducer to this. |
I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm not aware of these hangs, this needs some careful investigation and attention. It's nice to see that you have a workaround at the moment best of luck. |
This reproducer reproduces both issues mentioned above on the Crusher system. n->1 and 1->m communication patterns are not scalable on the system, using the MPI transport. I don't know if they should be, and neither do I know if this points out other scaling issues that might come up at full Frontier scale. If I remember correctly, neither communication pattern was an issue with libfabric on Summit. The reproducer very closely resembles the IO patterns of PIConGPU. It uses (the metadata of) a BP4 dataset written by PIConGPU on 128 nodes as basis for creating an SST stream. A second C++ code reads the code, triggering both issues. The ZIP file contains:
|
We have received info from the OLCF support that the UCX module on Crusher is not really supposed to be used. Since SST also has a direct TCP backend, I don't know if it would really be worth it to further try debugging this now. |
Franz Many thanks for providing an example source code to replicate this
issue. I will look into it and get back to you.
Vicente
…On Thu, Feb 23, 2023 at 9:29 AM Franz Pöschel ***@***.***> wrote:
I haven't tested the ucx dataplane with TCP/IP, only with RDMA fabric. I'm
not aware of these hangs, this needs some careful investigation and
attention. It's nice to see that you have a workaround at the moment best
of luck.
We have received info from the OLCF support that the UCX module on Crusher
is not really supposed to be used. Since SST also has a direct TCP backend,
I don't know if it would really be worth it to further try debugging this
now.
—
Reply to this email directly, view it on GitHub
<#3439 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHFOFX4B2U5LDKDLHYHKMTWY5X5XANCNFSM6AAAAAAT7DAWCY>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Vicente Adolfo Bolea Sanchez
Senior R&D Engineer | Kitware, Inc
|
@franzpoeschel I think I might know what can be the issue, however, I am having troubles to run your reproducer in Crusher, also not only your reproducer but any sample application using mpi client/server routines. I wonder if you could confirm that your reproducer still works. |
I tried running SST+MPI on Frontier right now. I get an assertion error when trying to run this from inside
|
Yep, even the minimal example fails: #include <mpi.h>
#include <stdlib.h>
#if !defined(MPICH)
#error "MPICH is the only supported library"
#endif
int main()
{
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
MPI_Finalize();
} --->
|
I see, it's good to see that it is not only me, I have also tried a
previous version of the cray-mpich module but it's still the same issue. I
will give a try to load older cray-mpich versions but definitely we need to
report this. I wonder if we have the same issue in Frontier.
…On Thu, Apr 20, 2023 at 9:03 AM Franz Pöschel ***@***.***> wrote:
Yep, even the minimal example fails:
#include <mpi.h>
#include <stdlib.h>
#if !defined(MPICH)
#error "MPICH is the only supported library"
#endif
int main()
{
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
MPI_Finalize();
}
--->
> ./mpi_minimal
Assertion failed in file ../src/mpid/ch4/netmod/ofi/ofi_spawn.c at line 753: 0
/opt/cray/pe/lib64/libmpi_cray.so.12(MPL_backtrace_show+0x26) [0x7fca4c6499ab]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x1fedbf4) [0x7fca4c083bf4]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x22a40d8) [0x7fca4c33a0d8]
/opt/cray/pe/lib64/libmpi_cray.so.12(+0x2027ef9) [0x7fca4c0bdef9]
/opt/cray/pe/lib64/libmpi_cray.so.12(MPI_Open_port+0x269) [0x7fca4bc2d839]
./mpi_minimal() [0x2018c8]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7fca4965129d]
./mpi_minimal() [0x2017ea]
MPICH ERROR [Rank 0] [job id ] [Thu Apr 20 09:03:20 2023] [login03] - Abort(1): Internal error
—
Reply to this email directly, view it on GitHub
<#3439 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHFOFXYAJVHARHSBFCRVSDXCEX3BANCNFSM6AAAAAAT7DAWCY>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Vicente Adolfo Bolea Sanchez
Senior R&D Engineer | Kitware, Inc
|
It was Frontier where I saw this. I wanted to try some things tomorrow, and if those don't help, we definitely need to report. |
I just sent a report, you are in CC |
Got it. Thanks!
…On Fri, Apr 21, 2023 at 9:17 AM Franz Pöschel ***@***.***> wrote:
I just sent a report, you are in CC
—
Reply to this email directly, view it on GitHub
<#3439 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHFOFUQFQE6MJQLTHQMHN3XCKCHFANCNFSM6AAAAAAT7DAWCY>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Vicente Adolfo Bolea Sanchez
Senior R&D Engineer | Kitware, Inc
|
OLCF Support just responded, it seems that single-node jobs don't initialize networking by default, making
|
The workaround specified by HPE (see my post above) does not help, with this configuration, the job does not even start:
However, the hint that "by default, the launcher and MPI will not do anything to set up networking when a job-step will only run on a single host" from the email made me try a two-node job for SST-MPI streaming where both subjobs ran on both nodes. This makes Unfortunately, streaming still does not work. The reader crashes as soon as it tries reading any data:
This being a closed-source custom implementation, I can't further research the error, but we'll need to get back to the support... |
I have sent a report. |
Hi Franz,
Many thanks for testing and reporting this new error, I will resume the
bugfixing after this is resolved or a workaround is found.
Vicente
…On Tue, May 9, 2023 at 11:08 AM Franz Pöschel ***@***.***> wrote:
I have sent a report.
—
Reply to this email directly, view it on GitHub
<#3439 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHFOFSXONXRMFDS34TMB23XFJMYNANCNFSM6AAAAAAT7DAWCY>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
--
Vicente Adolfo Bolea Sanchez
Senior R&D Engineer | Kitware, Inc
|
The MPI transport seems to be working again, tested by using the SST hello world examples:
I needed to use at least 2 MPI ranks per job since the network does not properly get initialized otherwise, and I used 4 different nodes since otherwise running asynchronous job turns into a hell of Slurm workarounds. Writer output:
Reader output:
The point that 2 ranks per task are needed at minimum can be shown with a minimal example: #include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
#if !defined(MPICH)
#error "MPICH is the only supported library"
#endif
int main()
{
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, NULL);
printf("MPICH %d.%d\n", MPI_VERSION, MPI_SUBVERSION);
MPI_Open_port(MPI_INFO_NULL, malloc(sizeof(char) * MPI_MAX_PORT_NAME));
MPI_Finalize();
}
|
@franzpoeschel many thanks for notifying me and letting me know the limitations of the workaround, yesterday after my return from holidays I was able to run the MPI DP in Crusher. I will be working on fixing this scalability issue. |
@franzpoeschel I have noticed that if you try to run two srun jobs without salloc the MPI_Comm_connect fails. This means that whatever application running ADIOS2 will have to first run an Salloc first or be contained in a single srun invocation. |
Describe the bug
I am trying to use the MPI backend of SST in order to couple a PIConGPU simulation with an asynchronous data sink (currently a synthetic application that just loads data and then throws it away) . Both the simulation and the data sink are parallel MPI applications.
There are 8 instances of PIConGPU running on each node (one per GPU) and additionally one CPU-only instance per node of the data sink. The data sink loads data from the simulation instances running on the same node, in order to ensure a scalable communication pattern.
I use weak scaling to scale the setup, it runs without trouble on 1 node, 8 nodes and on 12 nodes. There is no obvious performance difference. On 16 nodes, the setup will hang and be finally killed after the time limit of the job. The last message being printed by SstVerbose on the reading end is
MpiReadReplyHandler: Connecting to MPI Server
. The writing end seems to be blocking on theQueueFullPolicy=Block
condition. The reader receives all metadata without issue, so the trouble seems to be in the data plane.I attach the stderr log of both sides (quite big files unfortunately since they have output from all parallel instances), they include SstVerbose log
reader.err.txt
writer.err.txt
To me, this sounds more like an issue with the Cray MPI implementation that just supports a certain amount of MPI ports being open, and not like an ADIOS2 problem? Maybe it's also related to the MPI_Finalize problem described here: https://github.com/ornladios/ADIOS2/blob/master/docs/user_guide/source/advanced/ecp_hardware.rst
Are similar issues known? Have there been scaling tests on Crusher and have they been successful? Is there some parameter that I need to set?
To Reproduce
Complex setup, hard to reproduce. I don't know if a synthetic setup would show the same issue.
The engine parameters are
on the writing end:
on the reading end:
Expected behavior
Continued scaling as from 1 to 12 nodes.
Desktop (please complete the following information):
ADIOS2 git tag 1428da5 (current master)
Additional context
I was going to try the new ucx backend as an alternative, but the cray-ucx module on Crusher has apparently not been installed correctly (build paths leaking into the pkgconfig), so I need to wait for the system support to fix that. I assume that ucx-based SST has not been tried on the system yet?
Following up
The text was updated successfully, but these errors were encountered: