-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MPI Data plane cohort handling #3588
Conversation
c62fadb
to
f6b1907
Compare
65fcd6b
to
6942bca
Compare
b396764
to
f575887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume not quite ready yet since you still have printf's in it, but generally looks good.
f575887
to
73a6d49
Compare
73a6d49
to
4243161
Compare
28ee251
to
19d2c79
Compare
Hey Vicente, this is great to hear! |
Yes, please, if you can test this out that will be great! |
How scalable is this expected to be? Or SST in general? I have a test where I start from using and XGC restart file, to be sent to a reader. Each writing process reads one block of data from the BP file. By block, I mean an ADIOS block, where one block is from one of the original XGC writer process' send dispatch. To mimic the way XGC runs, there are 8 senders per node, and for now, I was just doing one reader, but I can turn that up too. The run worked at 64 sender nodes, and took about 7 seconds a single reader to receive the data. (I get around needing multiple readers by using MPMD.) But when I double to 128 sender nodes, the reader hangs. That's about 176 GB of data over 1024 blocks. Increasing the number of readers doesn't prevent the hang. I've tried adjusting some of the engine parameters too, but they don't seem to have impact relative to hanging. |
In general, the expectation is "highly scalable". As always the answer in practice depends upon a lot of things, what is scalable on one machine may not be on another, resource availability can be an issue at all levels, etc. A hang may mean a bug like a race condition, it may mean that an operation failed silently leaving SST waiting for a result that'll never come, or pretty much anything else. If possible, the first step is to turn on verbosity to figure out what is hanging. I can't speak to what level of verbosity is in the MPI data plane, but setting the environment variable "SstVerbose" to a numerical value between 1 and 5 is a place to start. 5 is the highest verbosity. |
I looked through the code a little, and did a rerun with |
Thanks. Those are important for the WAN transport and for the control plane. Probably don't play a role in whatever problem is happening in the MPI data plane, but giving Vicente more output doesn't hurt. |
I reverted the workarounds that I used so far and ran an example on Frontier at 128 nodes. EDIT: I tried if I can actually reproduce the original issue when using the ADIOS2 master branch instead of this. It seems that I can't, will need to have a closer look. Maybe my setup was not weird enough to trigger the issue. |
@suchyta1 thanks for reporting issues with the MPI DP. I Will take a look to it. Quick question: are you using the tip commit of this PR MR? @franzpoeschel I am happy that this fix your reproducer app. |
Yes, that's what I was using. |
I added some extra debugs, and as far as I can tell, there's a lock operation that is never returned from. |
Yes, that tracks. The interesting bit here is the bit from the reader-side data plane on rank 0 (which is the "DP Reader 0" line).
From the first line above, we know we're in the MPI data plane, in MpiWaitForCompletion(), and it has fallen into the CMCondition_wait() at line 571. That we never fall out of this means that the corresponding CMCondition_signal (at line 708 at the end of the MPIReadReplyHandler() subroutine) never happens. So, what's going on? I don't know, but it's certainly something at the dataplane level. Just what isn't clear and it's probably going to require the attention of the author @vicentebolea . |
@eisenhauer it appears that its waiting for the rank=652 (writer) to reply its message however the rank=652 has long been either dead or unresponsive. Few questions I have noticed that this rank has the same ID has the rank=748, is this odd? See below:
Similarly it seems that this setup of a single reader and 800 writers might cause to have 800 MPI connections on the reader side simultaneously, maybe hitting some limit of the mpi client/server implementations. I am working on a reproducer to further debug this. I will also investigate how to add such as test in the CI. |
The ID printed after the rank is the base address of the writer-side stream structure. If an application has more than one output stream open, this differentiates them in the output. But in your case it just means that two ranks happened to allocate the same base address for the stream structure. They're different processes, so this shouldn't matter. (They'd probably be the same more often if most OS' didn't use memory randomization as a security measure.) Anyhow, no help in this situation I'm afraid... |
Using this, I now had a full-scale run on Frontier from |
19d2c79
to
4e2f6eb
Compare
@eisenhauer do you think that we should have this merged before waiting the workaround of @franzpoeschel to be included (in a more generic bugfix form) in master? |
I would go ahead and merge. @franzpoeschel workaround is an orthogonal issue that just happens to also come up when we scale up. |
Fix MPI Data plane cohort handling (cherry picked from commit c77df61)
Merge pull request #3588 from vicentebolea/fix-mpi-dp
* release_29: (29 commits) Bump version to v2.9.2 ci: update number of task for mpich build clang-format: Correct format to old style Merge pull request #3878 from anagainaru/test-null-blocks Merge pull request #3588 from vicentebolea/fix-mpi-dp bp5: make RecMap an static anon namespaced var Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (#3877) Fix data length calculation for hash (#3875) Merge pull request #3823 from eisenhauer/SstMemSel gha,ci: update checkout to v4 Blosc2 USE ON: Fix Module Fallback cmake: correct prefer_shared_blosc behavior cmake: correct info.h installation path ci: disable MGARD static build operators: fix module library ci: add downloads readthedocs cmake: Add Blosc2 2.10.1 compatibility. Fix destdir install test (#3850) cmake: update minimum cmake to 3.12 (#3849) MPI: add timeout for conf test for MPI_DP (#3848) ...
* master: (126 commits) ReadMe.md: Mention 2.9.2 release Cleanup server output a bit (ornladios#3914) ci: set openmpi and openmp params Example using Kokkos buffers with SST Changes to MallocV to take into consideration the memory space of a variable Change install directory of Gray scott files again ci,crusher: increase supported num branches ci: add shellcheck coverage to source and testing Change install directory of Gray scott files Only rank 0 should print the initialization message in perfstub Defining and computing derived variables (ornladios#3816) Add Remote "-status" command to see if a server is running and where (ornladios#3911) examples,hip: use find_package(hip) once in proj Add Steps Tutorial Add Operators Tutorial Add Attributes Tutorial Add Variables Tutorial Add Hello World Tutorial Add Tutorials' Download and Build section Add Tutorials' Overview section Improve bpStepsWriteRead* examples Rename bpSZ to bpOperatorSZWriter Convert bpAttributeWriter to bpAttributeWriteRead Improve bpWriter/bpReader examples Close file after reading for hello-world.py Fix names of functions in engine Fix formatting warnings Add dataspaces.rst in the list of engines Add query.rst cmake: find threads package first docs: update new_release.md Bump version to v2.9.2 ci: update number of task for mpich build clang-format: Correct format to old style Merge pull request ornladios#3878 from anagainaru/test-null-blocks Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp Adding tests for writing null blocks with and without compression bp5: make RecMap an static anon namespaced var Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877) Fix data length calculation for hash (ornladios#3875) Merge pull request ornladios#3823 from eisenhauer/SstMemSel Merge pull request ornladios#3805 from pnorbert/fix-bpls-string-scalar Merge pull request ornladios#3804 from pnorbert/fix-aws-version Merge pull request ornladios#3759 from pnorbert/bp5dbg-metadata new attempt to commit query support of local array. (ornladios#3868) MPI::MPI_Fortran should be INTERFACE not PUBLIC Fix hip example compilation error (ornladios#3865) Server Improvements (ornladios#3862) ascent,ci: remove unshallow flag Remove Slack as a contact mechanism (ornladios#3866) bug fix: syntax error in json output (ornladios#3857) Update the bpWriterReadHip example's cmake to run on crusher Examples: Use BPFile instead of BP3/4/5 for future-proof inlineMWE example: Close files at the end Examples: Add BeginStep/EndStep wherever it was missing BP5Serializer: handle local variables that use operators (ornladios#3859) gha,ci: update checkout to v4 Blosc2 USE ON: Fix Module Fallback cmake: correct prefer_shared_blosc behavior cmake: correct info.h installation path ci: disable MGARD static build operators: fix module library ci: add downloads readthedocs cmake: Add Blosc2 2.10.1 compatibility. Blosc2 USE ON: Fix Module Fallback (ornladios#3774) Fix destdir install test (ornladios#3850) cmake: update minimum cmake to 3.12 (ornladios#3849) MPI: add timeout for conf test for MPI_DP (ornladios#3848) MPI_DP: do not call MPI_Init (ornladios#3847) install: export adios2 device variables (ornladios#3819) Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path SST,MPI,DP: soft handle peer error SST,MPI,DP: improve uniq identifier Fix destdir install test (ornladios#3850) cmake: include ctest before detectoptions ci: enable tau check Add/Improve the ReadMe.md files in examples directory Disable BUILD_TESTING and ADIOS2_BUILD_EXAMPLES by default Remove testing based on ADIOS2-examples Fix formatting issue in DetectOptions.cmake Add examples from ADIOS2-Examples Improve existing examples MPI_DP: do not call MPI_Init (ornladios#3847) cmake: update minimum cmake to 3.12 (ornladios#3849) MPI: add timeout for conf test for MPI_DP (ornladios#3848) Tweak Remote class and test multi-threaded file remote access (ornladios#3834) Add prototype testing of remote functionality (ornladios#3830) Try always using the MPI version Try always using the MPI version Import tests from bp to staging common, implement memory selection in SST ci: fix codeql ignore path (ornladios#3772) install: export adios2 device variables (ornladios#3819) added support to query BP5 files (ornladios#3809) Partial FFS Upstream, only changes to type_id ffs 2023-09-19 (67e411c0) Fix abs/rel step in BP5 DoCount fix dummy Win build Pass Array Order of reader to remote server for proper Get() operation ...
* master: Update readme for heat transfer example with new location and build instructions Ignore tests with defects for now Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672) ci: fix path to lsan suppressions, fix broken gh status post Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956) Add Global Array Capabilities and Limitations Add Section for Anatomy of an ADIOS Program Enable Shell-Check for gh-actions scripts Enable Shell-Check for circle CI scripts Enable Shell-Check for tau contract scripts Enable Shell-Check for scorpio contract scripts Enable Shell-Check for lammps contract scripts Delete VTK code in examples Fix MATLAB bindings for MacOS (ornladios#3950) Set the compiler for the Kokkos DataMan example to what is used to build Kokkos Fix the HIP architecture CMAKE variable (ornladios#3931) perfstubs 2023-11-27 (845d0702) (ornladios#3944) Revert "Only rank 0 should print the initialization message in perfstub" CI Contract: Build examples with external ADIOS Example using DataMan with Kokkos buffers Propagating the GPU logic inside the DataMan engine ci: Use mpich built with ch3:sock:tp for faster tests ReadMe.md: Mention 2.9.2 release Cleanup server output a bit (ornladios#3914) ci: set openmpi and openmp params Example using Kokkos buffers with SST Changes to MallocV to take into consideration the memory space of a variable Change install directory of Gray scott files again ci,crusher: increase supported num branches ci: add shellcheck coverage to source and testing Change install directory of Gray scott files Only rank 0 should print the initialization message in perfstub Defining and computing derived variables (ornladios#3816) Add Remote "-status" command to see if a server is running and where (ornladios#3911) examples,hip: use find_package(hip) once in proj Add Steps Tutorial Add Operators Tutorial Add Attributes Tutorial Add Variables Tutorial Add Hello World Tutorial Add Tutorials' Download and Build section Add Tutorials' Overview section Improve bpStepsWriteRead* examples Rename bpSZ to bpOperatorSZWriter Convert bpAttributeWriter to bpAttributeWriteRead Improve bpWriter/bpReader examples Close file after reading for hello-world.py Fix names of functions in engine Fix formatting warnings Add dataspaces.rst in the list of engines Add query.rst cmake: find threads package first docs: update new_release.md Bump version to v2.9.2 ci: update number of task for mpich build clang-format: Correct format to old style Merge pull request ornladios#3878 from anagainaru/test-null-blocks Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp bp5: make RecMap an static anon namespaced var Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877) Fix data length calculation for hash (ornladios#3875) Merge pull request ornladios#3823 from eisenhauer/SstMemSel gha,ci: update checkout to v4 Blosc2 USE ON: Fix Module Fallback cmake: correct prefer_shared_blosc behavior cmake: correct info.h installation path ci: disable MGARD static build operators: fix module library ci: add downloads readthedocs cmake: Add Blosc2 2.10.1 compatibility. Fix destdir install test (ornladios#3850) cmake: update minimum cmake to 3.12 (ornladios#3849) MPI: add timeout for conf test for MPI_DP (ornladios#3848) MPI_DP: do not call MPI_Init (ornladios#3847) install: export adios2 device variables (ornladios#3819) Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path Partial FFS Upstream, only changes to type_id bpls -l with scalar string variable: print the value (since min/max is empty). This changes the code for all types using Engine.Get() to get the value now. Set AWS version requirement to 1.10.15 and also turn it OFF by default as it is not a stable feature of ADIOS just yet. Fix local values block reading docs,ci: backport fixes for readthedocs
We have been observing scalability issues with the MPI DP. I have been able to narrow down to potential root causes and I was able to propose possible solutions to it in this PR.
Task id collision between different ranks at scale. This was expected to happens since we used the process id (PID) as the rank id (not a good idea). This has been replaced by an aggregate ID of the process PID and the machine hostid.
Currently in the event of a reader rank failure the MPI Dataplane write simply frees and disconnect to every resource pertaining to its cohort. Since at large scale the failure of readers nodes (or timeout event) are not that rare, in this PR I make the Write to skip releasing resources pertaining to the reader unless all the peers of the cohorts have exited. To make happy MPI, I do perform a
MPI_Port_disconnect
since otherwise the "failing" node might get stuck (in case of a non critical error).Possibly fixes: #3456
Possibly fixes: #3439
Possibly fixes: #3378
@franzpoeschel I was able to run the given reproducer in Crusher without getting any error with this changes.