slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

miketice22 · 2020-12-07T19:15:32Z

After installing/deploying slurm with 'ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml' The validation playbook fails.

$ ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml -e '{num_gpus: 1}'
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [slurm-master[0]] **********************************************************************************************************************************************************************************************

TASK [Get node count from sinfo] ************************************************************************************************************************************************************************************
changed: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set num_nodes variable] ***************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set cmd variable] *********************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Print node/gpu counts] ****************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu] => 
  msg:
  - Detected 1 nodes with 1 gpus each.
  - 'Proceeding to run validation test, this may take several minutes: srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1.'

TASK [Execute NCCL test across all nodes and GPUs] ******************************************************************************************************************************************************************
fatal: [aplcdhen01.datalake.jhuapl.edu]: FAILED! => changed=true 
  cmd: |-
    srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1
  delta: '0:01:47.939670'
  end: '2020-12-07 19:07:46.163125'
  msg: non-zero return code
  rc: 1
  start: '2020-12-07 19:05:58.223455'
  stderr: |-
    pyxis: importing docker image ...
    pyxis: creating container filesystem ...
    pyxis: starting container ...
    slurmstepd: error: pyxis: container start failed with error code: 1
    slurmstepd: error: pyxis: printing contents of log file ...
    slurmstepd: error: pyxis:     enroot-nsenter: failed to create user namespace: Invalid argument
    slurmstepd: error: pyxis: couldn't start container
    slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
    slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
    slurmstepd: error: Failed to invoke spank plugin stack
    srun: error: apl-redd-ai02.datalake.jhuapl.edu: task 0: Exited with exit code 1
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

The text was updated successfully, but these errors were encountered:

miketice22 · 2020-12-08T13:39:08Z

Implementing requirements from https://github.com/NVIDIA/enroot/blob/master/doc/requirements.md fixed.

supertetelman · 2021-03-05T20:22:20Z

Do you have the requirements you needed to implement to get this working? I just started running into this as well and am going to push a fix into the pyxis role.

miketice22 · 2021-03-05T20:35:44Z

In addition to the link I posted earlier, I just changed/edited some options:

remove '-G 2' and '-g 2' and add/replace with '--mpi=pmi2' and '--gres=gpu:2' to the srun command.

supertetelman · 2021-03-09T23:08:16Z

When we resolve this issue we should undo #902.

@miketice22 , did you need to add all the kernel parameter changes or just a few of them? I'm having some trouble identifying the gaps in a vanilla CentOS 7/8 install.

miketice22 · 2021-03-11T19:27:08Z

Had to add all of them.

supertetelman · 2021-03-17T00:01:25Z

This has been addressed here: NVIDIA/ansible-role-enroot#12

It will be making it's way back into DeepOps shortly.

github-actions · 2021-12-04T01:02:06Z

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

miketice22 closed this as completed Dec 8, 2020

supertetelman reopened this Mar 5, 2021

supertetelman changed the title ~~slurm validation playbook fails~~ slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS Mar 5, 2021

supertetelman mentioned this issue Mar 9, 2021

disable enroot tests for CentOS Slurm installs #902

Merged

ajdecon mentioned this issue Mar 19, 2021

Update to nvidia.enroot v0.4.0 role #918

Merged

github-actions bot added the no-issue-activity label Dec 4, 2021

github-actions bot closed this as completed Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

miketice22 commented Dec 7, 2020 •

edited

Loading

miketice22 commented Dec 8, 2020

supertetelman commented Mar 5, 2021

miketice22 commented Mar 5, 2021

supertetelman commented Mar 9, 2021

miketice22 commented Mar 11, 2021

supertetelman commented Mar 17, 2021

github-actions bot commented Dec 4, 2021

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

Comments

miketice22 commented Dec 7, 2020 • edited Loading

miketice22 commented Dec 8, 2020

supertetelman commented Mar 5, 2021

miketice22 commented Mar 5, 2021

supertetelman commented Mar 9, 2021

miketice22 commented Mar 11, 2021

supertetelman commented Mar 17, 2021

github-actions bot commented Dec 4, 2021

miketice22 commented Dec 7, 2020 •

edited

Loading