Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indirect slurm launch fails (5.0.0rc6) #1345

Closed
david-edwards-arm opened this issue Apr 21, 2022 · 4 comments
Closed

Indirect slurm launch fails (5.0.0rc6) #1345

david-edwards-arm opened this issue Apr 21, 2022 · 4 comments

Comments

@david-edwards-arm
Copy link

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Open MPI 5.0.0rc6

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Open MPI 5.0.0rc6

Please describe the system on which you are running

  • Operating system/version: RHEL 7.9
  • Computer hardware: x86_64 cluster
  • WLM: slurm 21.08.5

Details of the problem

Similar to #1251. Using 3rd-party/prrte/examples:

shell$ salloc -N 1
shell$ debugger/indirect mpirun -n 2 hello

gives output ending

WAITING FOR APPLICATION LAUNCH
terminate_fn called with status LOST_CONNECTION
Error: Launcher failed

Changing the command to use

shell$ salloc -N 1
shell$ debugger/indirect mpirun --mca plm ssh -n 2 hello

allows the application to be launched.

@rhc54
Copy link
Contributor

rhc54 commented Apr 21, 2022

You probably need to check with them to see if they have updated PRRTE yet on the release branch.

@david-edwards-arm
Copy link
Author

It looks like the submodule pointers have been updated, using the prrte v2.1 branch, however the fix for #1251 is present only on master and not v2.1; I haven't checked the status of the other tool fixes on the v2.1 branch. #1249 indicates v2.1 is the intended branch for the OMPI 5.0.0 release - can the tool fixes be cherry-picked over?

@rhc54
Copy link
Contributor

rhc54 commented Apr 21, 2022

Ah, indeed - I can update the release branch later today.

@rhc54
Copy link
Contributor

rhc54 commented Apr 21, 2022

Okay, release branch has been updated - you'll need to poke the OMPI folks to update their submodule pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants