Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

supertetelman · 2022-02-12T07:39:13Z

This update does the following:

Remove all old content of nvidia-mig.yml playbook
Update playbook to use nvidia-mig parted
Create a nvidia-mig-config.yml default clusterwide MIG config file

The new playbook is meant to be run on bare-metal and slurm systems. It does the following across all nodes:

Detect if there are any MIG-capable GPUs
Detect if nvidia-mig-parted is installed
Install nvidia-mig-parted systemd service if it was not installed (with the required docker)
Copy over config.yml
Apply MIG profiles, reboot nodes if necessary, and validate configuration

Test plan:
Due to the requirements for this playbook, no automated testing will/can be added. Manual testing on a fresh MIG-capable system without mig-parted installed, again with mig-parted installed, and with mig enabled and then disabled should be done in addition to a test on a non-mig system.

supertetelman · 2022-02-14T21:53:05Z

Just had a chat with the lead dev for the MIG Manager and we can remove the dependency on Docker and GitHub and instead pull the release .deb and .rpm files down for MIG manager install via packages.

supertetelman · 2022-02-15T00:12:45Z

Enable/Disable/Reconfigure tested on a DGX Station V100 (no-op) and DGX A100.

ajdecon

@supertetelman : My understanding was that nvidia-mig-parted should know how to stop and restart any necessary system daemons, but that doesn't seem to have worked in my test... 🤔

Tested on a freshly-installed DGX A100 with DGX OS 5.1.1.

$ git clone https://github.com/nvidia/deepops
$ cd deepops
$ git checkout -b supertetelman-mig-manager-bare-metal master
$ git pull https://github.com/supertetelman/deepops.git mig-manager-bare-metal
$ ./scripts/setup.sh

(edited config/inventory to add localhost)

$ ansible-playbook -b -l localhost -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml

... <snip> ...

TASK [Apply MIG configuration] *************************************************************************************************************************************************************************************************************fatal: [localhost]: FAILED! => changed=true
  cmd:
  - nvidia-mig-parted
  - apply
  - -f
  - /etc/nvidia-mig-manager/config.yml
  - -c
  - all-1g.10gb
  delta: '0:00:12.795265'
  end: '2022-02-22 15:44:32.609368'
  msg: non-zero return code
  rc: 1
  start: '2022-02-22 15:44:19.814103'
  stderr: |-
    time="2022-02-22T15:44:32-08:00" level=error msg="The following GPUs could not be reset:\n  GPU 00000000:07:00.0: In use by another client\n  GPU 00000000:0F:00.0: In use by another client\n  GPU 00000000:47:00.0: In use by another client\n  GPU 00000000:4E:00.0: In use by another client\n  GPU 00000000:87:00.0: In use by another client\n  GPU 00000000:90:00.0: In use by another client\n  GPU 00000000:B7:00.0: In use by another client\n  GPU 00000000:BD:00.0: In use by another client\n\n8 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.\n"
    time="2022-02-22T15:44:32-08:00" level=fatal msg="Error resetting all GPUs: exit status 255"
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

No CUDA applications running, so this should just be system services.

Anything special I need to do here?

supertetelman · 2022-02-23T01:20:40Z

Your understanding is correct, the only default services that mig-manager will not be able to stop are K8s services. It is assumed that with K8s you are using MIG Manager in the GPU Operator, otherwise, this would be a bug with the default mig manager deployment.

ajdecon

@supertetelman : The issue I'm running into looks like an issue with the underlying tool, but AFAICT the Ansible itself is doing what it should! 🤷🏻‍♂️

I'm approving this so you can merge when you want, but your call on whether to merge now or hold off.

ajdecon · 2022-03-01T16:15:30Z

playbooks/nvidia-software/nvidia-mig.yml

-      tags: disable, never
-      when: deepops_mig_devices | default("") != "" and  deepops_mig_devices | default("") != "all"
+    - name: Apply MIG configuration
+      command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }}


After discussion in an internal bug, this line should be:

command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }} -k /etc/nvidia-mig-manager/hooks.yaml

Where you might choose to put the hooks file in a variable as well. :)

ajdecon

Tested successfully on a DGX A100. 🎉

$ ansible-playbook -b -l $(hostname) -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml

...

$ nvidia-smi -L | head
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-fcb8230f-657b-e58a-4694-dfb23bf1c4d4)
  MIG 1g.10gb     Device  0: (UUID: MIG-42633f28-45cf-50d0-a6cb-2300e3990588)
  MIG 1g.10gb     Device  1: (UUID: MIG-1efacfc1-b1cc-5014-8304-3d55fb6830bf)
  MIG 1g.10gb     Device  2: (UUID: MIG-7cd2dabf-73d6-5549-8349-0176edb3910a)
  MIG 1g.10gb     Device  3: (UUID: MIG-95122cfc-47a4-56a1-8bba-d83a7fdff025)
  MIG 1g.10gb     Device  4: (UUID: MIG-b735d920-7a4e-58c4-80a2-c8e59e97a577)
  MIG 1g.10gb     Device  5: (UUID: MIG-9d197e9c-ca65-5a8c-803c-4af337ec4b5a)
  MIG 1g.10gb     Device  6: (UUID: MIG-4dbfe475-3001-5cb9-ac01-c2acb56392ab)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-7ff07158-723e-9284-233a-abb461738d29)
  MIG 1g.10gb     Device  0: (UUID: MIG-baaf6225-2104-5faf-94d8-8db41ff37828)

LGTM, merging

supertetelman marked this pull request as draft February 12, 2022 07:39

supertetelman changed the title ~~Initial draft at wrapping nvidia-mig.yml around mig-parted~~ [WIP] Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service Feb 12, 2022

supertetelman force-pushed the mig-manager-bare-metal branch from 47b4490 to 502a338 Compare February 12, 2022 07:42

Initial draft at wrapping nvidia-mig.yml around mig-parted

64f29bb

supertetelman force-pushed the mig-manager-bare-metal branch from 502a338 to 64f29bb Compare February 12, 2022 08:19

supertetelman mentioned this pull request Feb 12, 2022

Install mig-parted on any MIG-capable system and update nvidia-mig.yml to enable mig-parted systemd service #887

Closed

Create nvidia-mig-manager role for mig-parted install

b8ffa2e

supertetelman force-pushed the mig-manager-bare-metal branch from 88aca18 to b8ffa2e Compare February 14, 2022 23:39

Update MIG config for 80GB A100s

fefd906

supertetelman marked this pull request as ready for review February 15, 2022 00:13

Update MIG docs

744d2a1

supertetelman force-pushed the mig-manager-bare-metal branch from d9dffa1 to 744d2a1 Compare February 15, 2022 00:14

supertetelman changed the title ~~[WIP] Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service~~ Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service Feb 15, 2022

ajdecon self-assigned this Feb 16, 2022

ajdecon reviewed Feb 22, 2022

View reviewed changes

ajdecon approved these changes Feb 28, 2022

View reviewed changes

ajdecon requested changes Mar 1, 2022

View reviewed changes

Update mig-manager playbook in accordance with PR feedback/internal bugs

ed18f05

supertetelman force-pushed the mig-manager-bare-metal branch from ffbc364 to ed18f05 Compare March 2, 2022 21:22

ajdecon approved these changes Mar 8, 2022

View reviewed changes

ajdecon merged commit c0ac9b4 into NVIDIA:master Mar 8, 2022

supertetelman deleted the mig-manager-bare-metal branch March 8, 2022 18:38

ajdecon mentioned this pull request Apr 26, 2022

DeepOps Release 22.04 #1164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

supertetelman commented Feb 12, 2022

supertetelman commented Feb 14, 2022

supertetelman commented Feb 15, 2022

ajdecon left a comment

supertetelman commented Feb 23, 2022

ajdecon left a comment

ajdecon Mar 1, 2022

ajdecon left a comment

Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

Conversation

supertetelman commented Feb 12, 2022

supertetelman commented Feb 14, 2022

supertetelman commented Feb 15, 2022

ajdecon left a comment

Choose a reason for hiding this comment

supertetelman commented Feb 23, 2022

ajdecon left a comment

Choose a reason for hiding this comment

ajdecon Mar 1, 2022

Choose a reason for hiding this comment

ajdecon left a comment

Choose a reason for hiding this comment