Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI Data plane cohort handling #3588

Merged
merged 2 commits into from
Oct 31, 2023

Conversation

vicentebolea
Copy link
Collaborator

@vicentebolea vicentebolea commented Apr 14, 2023

We have been observing scalability issues with the MPI DP. I have been able to narrow down to potential root causes and I was able to propose possible solutions to it in this PR.

  • Task id collision between different ranks at scale. This was expected to happens since we used the process id (PID) as the rank id (not a good idea). This has been replaced by an aggregate ID of the process PID and the machine hostid.

  • Currently in the event of a reader rank failure the MPI Dataplane write simply frees and disconnect to every resource pertaining to its cohort. Since at large scale the failure of readers nodes (or timeout event) are not that rare, in this PR I make the Write to skip releasing resources pertaining to the reader unless all the peers of the cohorts have exited. To make happy MPI, I do perform a MPI_Port_disconnect since otherwise the "failing" node might get stuck (in case of a non critical error).

Possibly fixes: #3456
Possibly fixes: #3439
Possibly fixes: #3378

@franzpoeschel I was able to run the given reproducer in Crusher without getting any error with this changes.

@vicentebolea vicentebolea changed the title wipwipwip Fix MPI Data plane cohort handling Apr 14, 2023
@vicentebolea vicentebolea force-pushed the fix-mpi-dp branch 2 times, most recently from 65fcd6b to 6942bca Compare April 17, 2023 20:40
@vicentebolea vicentebolea added this to the v2.9.1 milestone May 12, 2023
@vicentebolea vicentebolea marked this pull request as ready for review June 19, 2023 16:08
@vicentebolea vicentebolea modified the milestones: v2.9.1, v2.9.2 Jul 7, 2023
@vicentebolea vicentebolea force-pushed the fix-mpi-dp branch 4 times, most recently from b396764 to f575887 Compare August 29, 2023 01:44
Copy link
Member

@eisenhauer eisenhauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume not quite ready yet since you still have printf's in it, but generally looks good.

@franzpoeschel
Copy link
Contributor

franzpoeschel commented Sep 5, 2023

Hey Vicente, this is great to hear!
I understand correctly that this is already in a usable state so I can test this out in the next days, using our actual simulation code?

@vicentebolea
Copy link
Collaborator Author

Hey Vicente, this is great to hear! I understand correctly that this is already in a usable state so I can test this out in the next days, using our actual simulation code?

Yes, please, if you can test this out that will be great!

@suchyta1
Copy link
Member

suchyta1 commented Sep 6, 2023

How scalable is this expected to be? Or SST in general?

I have a test where I start from using and XGC restart file, to be sent to a reader. Each writing process reads one block of data from the BP file. By block, I mean an ADIOS block, where one block is from one of the original XGC writer process' send dispatch. To mimic the way XGC runs, there are 8 senders per node, and for now, I was just doing one reader, but I can turn that up too. The run worked at 64 sender nodes, and took about 7 seconds a single reader to receive the data. (I get around needing multiple readers by using MPMD.) But when I double to 128 sender nodes, the reader hangs. That's about 176 GB of data over 1024 blocks. Increasing the number of readers doesn't prevent the hang. I've tried adjusting some of the engine parameters too, but they don't seem to have impact relative to hanging.

@eisenhauer
Copy link
Member

How scalable is this expected to be? Or SST in general?

In general, the expectation is "highly scalable". As always the answer in practice depends upon a lot of things, what is scalable on one machine may not be on another, resource availability can be an issue at all levels, etc. A hang may mean a bug like a race condition, it may mean that an operation failed silently leaving SST waiting for a result that'll never come, or pretty much anything else. If possible, the first step is to turn on verbosity to figure out what is hanging. I can't speak to what level of verbosity is in the MPI data plane, but setting the environment variable "SstVerbose" to a numerical value between 1 and 5 is a place to start. 5 is the highest verbosity.

@suchyta1
Copy link
Member

suchyta1 commented Sep 7, 2023

I looked through the code a little, and did a rerun with CMVerbose=1 and CMLowLevelVerbose=1, so the log attached now has more information.

slurm-1432345.out.log.zip

@eisenhauer
Copy link
Member

I looked through the code a little, and did a rerun with CMVerbose=1 and CMLowLevelVerbose=1, so the log attached now has more information.

Thanks. Those are important for the WAN transport and for the control plane. Probably don't play a role in whatever problem is happening in the MPI data plane, but giving Vicente more output doesn't hurt.

@franzpoeschel
Copy link
Contributor

franzpoeschel commented Sep 11, 2023

I reverted the workarounds that I used so far and ran an example on Frontier at 128 nodes.
With the rolled-back workarounds, there is now one instance of N:1 and one instance of 1:M communication. Seems to work without trouble with this :)

EDIT: I tried if I can actually reproduce the original issue when using the ADIOS2 master branch instead of this. It seems that I can't, will need to have a closer look. Maybe my setup was not weird enough to trigger the issue.

@vicentebolea
Copy link
Collaborator Author

@suchyta1 thanks for reporting issues with the MPI DP. I Will take a look to it. Quick question: are you using the tip commit of this PR MR?

@franzpoeschel I am happy that this fix your reproducer app.

@suchyta1
Copy link
Member

Yes, that's what I was using.

@suchyta1
Copy link
Member

I added some extra debugs, and as far as I can tell, there's a lock operation that is never returned from.
slurm-1432539.out.zip

@eisenhauer
Copy link
Member

Yes, that tracks. The interesting bit here is the bit from the reader-side data plane on rank 0 (which is the "DP Reader 0" line).

DP Reader 0 (0x1004380): Waiting for completion of memory read to rank 652, condition 3855,timestep=0, is_local=0
Writev success
DP Reader 0 (0x1004380): Remote read; CMCondition_wait
Before unlockCManager Unlock at "/ccs/home/esuchyta/software/build/frontier/adios2-gcc12.2.0-test/thirdparty/EVPath/EVPath/cm_interface.c" line 632
After unlockBefore lockCManager Lock at "/ccs/home/esuchyta/software/build/frontier/adios2-gcc12.2.0-test/thirdparty/EVPath/EVPath/cm_interface.c" line 84

From the first line above, we know we're in the MPI data plane, in MpiWaitForCompletion(), and it has fallen into the CMCondition_wait() at line 571. That we never fall out of this means that the corresponding CMCondition_signal (at line 708 at the end of the MPIReadReplyHandler() subroutine) never happens. So, what's going on? I don't know, but it's certainly something at the dataplane level. Just what isn't clear and it's probably going to require the attention of the author @vicentebolea .

@vicentebolea
Copy link
Collaborator Author

@eisenhauer it appears that its waiting for the rank=652 (writer) to reply its message however the rank=652 has long been either dead or unresponsive.

Few questions I have noticed that this rank has the same ID has the rank=748, is this odd? See below:

DP Writer 652 (0x102c3a0): MpiReadRequestHandler: Replying reader=0 with MPI port name=tag#0$connentry#0140AE23$
DP Writer 748 (0x102c3a0): MpiReadRequestHandler: Replying reader=0 with MPI port name=tag#0$connentry#01408A24$

Similarly it seems that this setup of a single reader and 800 writers might cause to have 800 MPI connections on the reader side simultaneously, maybe hitting some limit of the mpi client/server implementations. I am working on a reproducer to further debug this. I will also investigate how to add such as test in the CI.

@eisenhauer
Copy link
Member

Few questions I have noticed that this rank has the same ID has the rank=748, is this odd? See below:

The ID printed after the rank is the base address of the writer-side stream structure. If an application has more than one output stream open, this differentiates them in the output. But in your case it just means that two ranks happened to allocate the same base address for the stream structure. They're different processes, so this shouldn't matter. (They'd probably be the same more often if most OS' didn't use memory randomization as a security measure.)

Anyhow, no help in this situation I'm afraid...

@franzpoeschel
Copy link
Contributor

Using this, I now had a full-scale run on Frontier from 9126(nodes) x 8(GPUs) writers to 9126(nodes) readers. Data exchange happened only node-internally, 6Gb were exchanged per step and node (not much, but I was focusing on scaling the application to a large number of writers/readers for now).
I have not (yet) done a full evaluation of performance, but by squinting at the numbers, I/O time looks stable.
The workaround described in #3846 was necessary to scale above 7000 nodes, but that has nothing to do with this PR.

@vicentebolea
Copy link
Collaborator Author

@eisenhauer do you think that we should have this merged before waiting the workaround of @franzpoeschel to be included (in a more generic bugfix form) in master?

@eisenhauer
Copy link
Member

@eisenhauer do you think that we should have this merged before waiting the workaround of @franzpoeschel to be included (in a more generic bugfix form) in master?

I would go ahead and merge. @franzpoeschel workaround is an orthogonal issue that just happens to also come up when we scale up.

@vicentebolea vicentebolea modified the milestones: v2.9.2, v2.10.0 Oct 24, 2023
@vicentebolea vicentebolea merged commit c77df61 into ornladios:master Oct 31, 2023
30 checks passed
@vicentebolea vicentebolea modified the milestones: v2.10.0, v2.9.2 Oct 31, 2023
vicentebolea added a commit to vicentebolea/ADIOS2 that referenced this pull request Oct 31, 2023
Fix MPI Data plane cohort handling

(cherry picked from commit c77df61)
vicentebolea added a commit that referenced this pull request Nov 1, 2023
Merge pull request #3588 from vicentebolea/fix-mpi-dp
vicentebolea added a commit that referenced this pull request Nov 1, 2023
* release_29: (29 commits)
  Bump version to v2.9.2
  ci: update number of task for mpich build
  clang-format: Correct format to old style
  Merge pull request #3878 from anagainaru/test-null-blocks
  Merge pull request #3588 from vicentebolea/fix-mpi-dp
  bp5: make RecMap an static anon namespaced var
  Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (#3877)
  Fix data length calculation for hash (#3875)
  Merge pull request #3823 from eisenhauer/SstMemSel
  gha,ci: update checkout to v4
  Blosc2 USE ON: Fix Module Fallback
  cmake: correct prefer_shared_blosc behavior
  cmake: correct info.h installation path
  ci: disable MGARD static build
  operators: fix module library
  ci: add downloads readthedocs
  cmake: Add Blosc2 2.10.1 compatibility.
  Fix destdir install test (#3850)
  cmake: update minimum cmake to 3.12 (#3849)
  MPI: add timeout for conf test for MPI_DP (#3848)
  ...
pnorbert added a commit to pnorbert/ADIOS2 that referenced this pull request Nov 20, 2023
* master: (126 commits)
  ReadMe.md: Mention 2.9.2 release
  Cleanup server output a bit (ornladios#3914)
  ci: set openmpi and openmp params
  Example using Kokkos buffers with SST
  Changes to MallocV to take into consideration the memory space of a variable
  Change install directory of Gray scott files again
  ci,crusher: increase supported num branches
  ci: add shellcheck coverage to source and testing
  Change install directory of Gray scott files
  Only rank 0 should print the initialization message in perfstub
  Defining and computing derived variables (ornladios#3816)
  Add Remote "-status" command to see if a server is running and where (ornladios#3911)
  examples,hip: use find_package(hip) once in proj
  Add Steps Tutorial
  Add Operators Tutorial
  Add Attributes Tutorial
  Add Variables Tutorial
  Add Hello World Tutorial
  Add Tutorials' Download and Build section
  Add Tutorials' Overview section
  Improve bpStepsWriteRead* examples
  Rename bpSZ to bpOperatorSZWriter
  Convert bpAttributeWriter to bpAttributeWriteRead
  Improve bpWriter/bpReader examples
  Close file after reading for hello-world.py
  Fix names of functions in engine
  Fix formatting warnings
  Add dataspaces.rst in the list of engines
  Add query.rst
  cmake: find threads package first
  docs: update new_release.md
  Bump version to v2.9.2
  ci: update number of task for mpich build
  clang-format: Correct format to old style
  Merge pull request ornladios#3878 from anagainaru/test-null-blocks
  Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp
  Adding tests for writing null blocks with and without compression
  bp5: make RecMap an static anon namespaced var
  Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly.
  Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877)
  Fix data length calculation for hash (ornladios#3875)
  Merge pull request ornladios#3823 from eisenhauer/SstMemSel
  Merge pull request ornladios#3805 from pnorbert/fix-bpls-string-scalar
  Merge pull request ornladios#3804 from pnorbert/fix-aws-version
  Merge pull request ornladios#3759 from pnorbert/bp5dbg-metadata
  new attempt to commit query support of local array. (ornladios#3868)
  MPI::MPI_Fortran should be INTERFACE not PUBLIC
  Fix hip example compilation error (ornladios#3865)
  Server Improvements (ornladios#3862)
  ascent,ci: remove unshallow flag
  Remove Slack as a contact mechanism (ornladios#3866)
  bug fix:  syntax error in json  output (ornladios#3857)
  Update the bpWriterReadHip example's cmake to run on crusher
  Examples: Use BPFile instead of BP3/4/5 for future-proof
  inlineMWE example: Close files at the end
  Examples: Add BeginStep/EndStep wherever it was missing
  BP5Serializer: handle local variables that use operators (ornladios#3859)
  gha,ci: update checkout to v4
  Blosc2 USE ON: Fix Module Fallback
  cmake: correct prefer_shared_blosc behavior
  cmake: correct info.h installation path
  ci: disable MGARD static build
  operators: fix module library
  ci: add downloads readthedocs
  cmake: Add Blosc2 2.10.1 compatibility.
  Blosc2 USE ON: Fix Module Fallback (ornladios#3774)
  Fix destdir install test (ornladios#3850)
  cmake: update minimum cmake to 3.12 (ornladios#3849)
  MPI: add timeout for conf test for MPI_DP (ornladios#3848)
  MPI_DP: do not call MPI_Init (ornladios#3847)
  install: export adios2 device variables (ornladios#3819)
  Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp
  Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path
  SST,MPI,DP: soft handle peer error
  SST,MPI,DP: improve uniq identifier
  Fix destdir install test (ornladios#3850)
  cmake: include ctest before detectoptions
  ci: enable tau check
  Add/Improve the ReadMe.md files in examples directory
  Disable BUILD_TESTING and ADIOS2_BUILD_EXAMPLES by default
  Remove testing based on ADIOS2-examples
  Fix formatting issue in DetectOptions.cmake
  Add examples from ADIOS2-Examples
  Improve existing examples
  MPI_DP: do not call MPI_Init (ornladios#3847)
  cmake: update minimum cmake to 3.12 (ornladios#3849)
  MPI: add timeout for conf test for MPI_DP (ornladios#3848)
  Tweak Remote class and test multi-threaded file remote access (ornladios#3834)
  Add prototype testing of remote functionality (ornladios#3830)
  Try always using the MPI version
  Try always using the MPI version
  Import tests from bp to staging common, implement memory selection in SST
  ci: fix codeql ignore path (ornladios#3772)
  install: export adios2 device variables (ornladios#3819)
  added support to query BP5 files (ornladios#3809)
  Partial FFS Upstream, only changes to type_id
  ffs 2023-09-19 (67e411c0)
  Fix abs/rel step in BP5 DoCount
  fix dummy Win build
  Pass Array Order of reader to remote server for proper Get() operation
  ...
dmitry-ganyushin added a commit to dmitry-ganyushin/ADIOS2 that referenced this pull request Dec 7, 2023
* master:
  Update readme for heat transfer example with new location and build instructions
  Ignore tests with defects for now
  Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672)
  ci: fix path to lsan suppressions, fix broken gh status post
  Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956)
  Add Global Array Capabilities and Limitations
  Add Section for Anatomy of an ADIOS Program
  Enable Shell-Check for gh-actions scripts
  Enable Shell-Check for circle CI scripts
  Enable Shell-Check for tau contract scripts
  Enable Shell-Check for scorpio contract scripts
  Enable Shell-Check for lammps contract scripts
  Delete VTK code in examples
  Fix MATLAB bindings for MacOS (ornladios#3950)
  Set the compiler for the Kokkos DataMan example to what is used to build Kokkos
  Fix the HIP architecture CMAKE variable (ornladios#3931)
  perfstubs 2023-11-27 (845d0702) (ornladios#3944)
  Revert "Only rank 0 should print the initialization message in perfstub"
  CI Contract: Build examples with external ADIOS
  Example using DataMan with Kokkos buffers
  Propagating the GPU logic inside the DataMan engine
  ci: Use mpich built with ch3:sock:tp for faster tests
  ReadMe.md: Mention 2.9.2 release
  Cleanup server output a bit (ornladios#3914)
  ci: set openmpi and openmp params
  Example using Kokkos buffers with SST
  Changes to MallocV to take into consideration the memory space of a variable
  Change install directory of Gray scott files again
  ci,crusher: increase supported num branches
  ci: add shellcheck coverage to source and testing
  Change install directory of Gray scott files
  Only rank 0 should print the initialization message in perfstub
  Defining and computing derived variables (ornladios#3816)
  Add Remote "-status" command to see if a server is running and where (ornladios#3911)
  examples,hip: use find_package(hip) once in proj
  Add Steps Tutorial
  Add Operators Tutorial
  Add Attributes Tutorial
  Add Variables Tutorial
  Add Hello World Tutorial
  Add Tutorials' Download and Build section
  Add Tutorials' Overview section
  Improve bpStepsWriteRead* examples
  Rename bpSZ to bpOperatorSZWriter
  Convert bpAttributeWriter to bpAttributeWriteRead
  Improve bpWriter/bpReader examples
  Close file after reading for hello-world.py
  Fix names of functions in engine
  Fix formatting warnings
  Add dataspaces.rst in the list of engines
  Add query.rst
  cmake: find threads package first
  docs: update new_release.md
  Bump version to v2.9.2
  ci: update number of task for mpich build
  clang-format: Correct format to old style
  Merge pull request ornladios#3878 from anagainaru/test-null-blocks
  Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp
  bp5: make RecMap an static anon namespaced var
  Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877)
  Fix data length calculation for hash (ornladios#3875)
  Merge pull request ornladios#3823 from eisenhauer/SstMemSel
  gha,ci: update checkout to v4
  Blosc2 USE ON: Fix Module Fallback
  cmake: correct prefer_shared_blosc behavior
  cmake: correct info.h installation path
  ci: disable MGARD static build
  operators: fix module library
  ci: add downloads readthedocs
  cmake: Add Blosc2 2.10.1 compatibility.
  Fix destdir install test (ornladios#3850)
  cmake: update minimum cmake to 3.12 (ornladios#3849)
  MPI: add timeout for conf test for MPI_DP (ornladios#3848)
  MPI_DP: do not call MPI_Init (ornladios#3847)
  install: export adios2 device variables (ornladios#3819)
  Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp
  Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path
  Partial FFS Upstream, only changes to type_id
  bpls -l  with scalar string variable: print the value (since min/max is empty). This changes the code for all types using Engine.Get() to get the value now.
  Set AWS version requirement to 1.10.15 and also turn it OFF by default as it is not a stable feature of ADIOS just yet.
  Fix local values block reading
  docs,ci: backport fixes for readthedocs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants