Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indirect launch fails using slurm #1251

Closed
david-edwards-arm opened this issue Mar 8, 2022 · 2 comments · Fixed by #1307
Closed

Indirect launch fails using slurm #1251

david-edwards-arm opened this issue Mar 8, 2022 · 2 comments · Fixed by #1307

Comments

@david-edwards-arm
Copy link

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz

Please describe the system on which you are running

  • Operating system/version: RHEL 7
  • Computer hardware: x86_64 cluster
  • WLM: Slurm

Details of the problem

The context of the issue is indirect launch of a job under control of a debugger.
Broadly following the indirect.c example,

shell$ salloc -N 1
shell$ indirect mpirun -n 2 <app>

gives output

An PRTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

Changing the command to use

shell$ salloc -N 1
shell$ indirect mpirun --mca plm ssh -n 2 <app>

allows the job to complete - this issue appears to be specific to slurm integration.

@rhc54
Copy link
Contributor

rhc54 commented Mar 25, 2022

@ggouaillardet I could use your help here, if you have a little time. I honestly am having no luck digging into why this launch is failing. In my case, the srun cmd to launch the prteds just hangs. Yet I can copy/paste that same cmd string to launch any other app without problem. Any assistance debugging the problem would be much appreciated.

@rhc54
Copy link
Contributor

rhc54 commented Mar 26, 2022

@ggouaillardet I finally found the problem. Slurm struck again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants