Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS #784

Closed
miketice22 opened this issue Dec 7, 2020 · 7 comments

Comments

@miketice22
Copy link

miketice22 commented Dec 7, 2020

After installing/deploying slurm with 'ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml' The validation playbook fails.

$ ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml -e '{num_gpus: 1}'
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [slurm-master[0]] **********************************************************************************************************************************************************************************************

TASK [Get node count from sinfo] ************************************************************************************************************************************************************************************
changed: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set num_nodes variable] ***************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Set cmd variable] *********************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu]

TASK [Print node/gpu counts] ****************************************************************************************************************************************************************************************
ok: [aplcdhen01.datalake.jhuapl.edu] => 
  msg:
  - Detected 1 nodes with 1 gpus each.
  - 'Proceeding to run validation test, this may take several minutes: srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1.'

TASK [Execute NCCL test across all nodes and GPUs] ******************************************************************************************************************************************************************
fatal: [aplcdhen01.datalake.jhuapl.edu]: FAILED! => changed=true 
  cmd: |-
    srun -N 1 -G 1 --ntasks-per-node=1 --mpi=pmix --exclusive  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest all_reduce_perf -b 1M -e 4G -f 2 -g 1
  delta: '0:01:47.939670'
  end: '2020-12-07 19:07:46.163125'
  msg: non-zero return code
  rc: 1
  start: '2020-12-07 19:05:58.223455'
  stderr: |-
    pyxis: importing docker image ...
    pyxis: creating container filesystem ...
    pyxis: starting container ...
    slurmstepd: error: pyxis: container start failed with error code: 1
    slurmstepd: error: pyxis: printing contents of log file ...
    slurmstepd: error: pyxis:     enroot-nsenter: failed to create user namespace: Invalid argument
    slurmstepd: error: pyxis: couldn't start container
    slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
    slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
    slurmstepd: error: Failed to invoke spank plugin stack
    srun: error: apl-redd-ai02.datalake.jhuapl.edu: task 0: Exited with exit code 1
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
@miketice22
Copy link
Author

Implementing requirements from https://github.com/NVIDIA/enroot/blob/master/doc/requirements.md fixed.

@supertetelman
Copy link
Collaborator

Do you have the requirements you needed to implement to get this working? I just started running into this as well and am going to push a fix into the pyxis role.

@supertetelman supertetelman reopened this Mar 5, 2021
@supertetelman supertetelman changed the title slurm validation playbook fails slurm validation playbook fails - Pyxis/Enroot failing to run jobs on CentOS Mar 5, 2021
@miketice22
Copy link
Author

In addition to the link I posted earlier, I just changed/edited some options:

remove '-G 2' and '-g 2' and add/replace with '--mpi=pmi2' and '--gres=gpu:2' to the srun command.

@supertetelman
Copy link
Collaborator

When we resolve this issue we should undo #902.

@miketice22 , did you need to add all the kernel parameter changes or just a few of them? I'm having some trouble identifying the gaps in a vanilla CentOS 7/8 install.

@miketice22
Copy link
Author

Had to add all of them.

@supertetelman
Copy link
Collaborator

This has been addressed here: NVIDIA/ansible-role-enroot#12

It will be making it's way back into DeepOps shortly.

@github-actions
Copy link

github-actions bot commented Dec 4, 2021

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants