Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106

Merged
merged 5 commits into from
Mar 8, 2022

Conversation

supertetelman
Copy link
Collaborator

This update does the following:

  • Remove all old content of nvidia-mig.yml playbook
  • Update playbook to use nvidia-mig parted
  • Create a nvidia-mig-config.yml default clusterwide MIG config file

The new playbook is meant to be run on bare-metal and slurm systems. It does the following across all nodes:

  • Detect if there are any MIG-capable GPUs
  • Detect if nvidia-mig-parted is installed
  • Install nvidia-mig-parted systemd service if it was not installed (with the required docker)
  • Copy over config.yml
  • Apply MIG profiles, reboot nodes if necessary, and validate configuration

Test plan:
Due to the requirements for this playbook, no automated testing will/can be added. Manual testing on a fresh MIG-capable system without mig-parted installed, again with mig-parted installed, and with mig enabled and then disabled should be done in addition to a test on a non-mig system.

@supertetelman supertetelman marked this pull request as draft February 12, 2022 07:39
@supertetelman supertetelman changed the title Initial draft at wrapping nvidia-mig.yml around mig-parted [WIP] Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service Feb 12, 2022
@supertetelman
Copy link
Collaborator Author

Just had a chat with the lead dev for the MIG Manager and we can remove the dependency on Docker and GitHub and instead pull the release .deb and .rpm files down for MIG manager install via packages.

@supertetelman
Copy link
Collaborator Author

Enable/Disable/Reconfigure tested on a DGX Station V100 (no-op) and DGX A100.

@supertetelman supertetelman marked this pull request as ready for review February 15, 2022 00:13
@supertetelman supertetelman changed the title [WIP] Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service Feb 15, 2022
@ajdecon ajdecon self-assigned this Feb 16, 2022
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@supertetelman : My understanding was that nvidia-mig-parted should know how to stop and restart any necessary system daemons, but that doesn't seem to have worked in my test... 🤔

Tested on a freshly-installed DGX A100 with DGX OS 5.1.1.

$ git clone https://github.com/nvidia/deepops
$ cd deepops
$ git checkout -b supertetelman-mig-manager-bare-metal master
$ git pull https://github.com/supertetelman/deepops.git mig-manager-bare-metal
$ ./scripts/setup.sh

(edited config/inventory to add localhost)

$ ansible-playbook -b -l localhost -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml

... <snip> ...

TASK [Apply MIG configuration] *************************************************************************************************************************************************************************************************************fatal: [localhost]: FAILED! => changed=true
  cmd:
  - nvidia-mig-parted
  - apply
  - -f
  - /etc/nvidia-mig-manager/config.yml
  - -c
  - all-1g.10gb
  delta: '0:00:12.795265'
  end: '2022-02-22 15:44:32.609368'
  msg: non-zero return code
  rc: 1
  start: '2022-02-22 15:44:19.814103'
  stderr: |-
    time="2022-02-22T15:44:32-08:00" level=error msg="The following GPUs could not be reset:\n  GPU 00000000:07:00.0: In use by another client\n  GPU 00000000:0F:00.0: In use by another client\n  GPU 00000000:47:00.0: In use by another client\n  GPU 00000000:4E:00.0: In use by another client\n  GPU 00000000:87:00.0: In use by another client\n  GPU 00000000:90:00.0: In use by another client\n  GPU 00000000:B7:00.0: In use by another client\n  GPU 00000000:BD:00.0: In use by another client\n\n8 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.\n"
    time="2022-02-22T15:44:32-08:00" level=fatal msg="Error resetting all GPUs: exit status 255"
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

No CUDA applications running, so this should just be system services.

Anything special I need to do here?

@supertetelman
Copy link
Collaborator Author

Your understanding is correct, the only default services that mig-manager will not be able to stop are K8s services. It is assumed that with K8s you are using MIG Manager in the GPU Operator, otherwise, this would be a bug with the default mig manager deployment.

Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@supertetelman : The issue I'm running into looks like an issue with the underlying tool, but AFAICT the Ansible itself is doing what it should! 🤷🏻‍♂️

I'm approving this so you can merge when you want, but your call on whether to merge now or hold off.

tags: disable, never
when: deepops_mig_devices | default("") != "" and deepops_mig_devices | default("") != "all"
- name: Apply MIG configuration
command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion in an internal bug, this line should be:

command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }} -k /etc/nvidia-mig-manager/hooks.yaml

Where you might choose to put the hooks file in a variable as well. :)

@supertetelman supertetelman force-pushed the mig-manager-bare-metal branch from ffbc364 to ed18f05 Compare March 2, 2022 21:22
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested successfully on a DGX A100. 🎉

$ ansible-playbook -b -l $(hostname) -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml

...

$ nvidia-smi -L | head
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-fcb8230f-657b-e58a-4694-dfb23bf1c4d4)
  MIG 1g.10gb     Device  0: (UUID: MIG-42633f28-45cf-50d0-a6cb-2300e3990588)
  MIG 1g.10gb     Device  1: (UUID: MIG-1efacfc1-b1cc-5014-8304-3d55fb6830bf)
  MIG 1g.10gb     Device  2: (UUID: MIG-7cd2dabf-73d6-5549-8349-0176edb3910a)
  MIG 1g.10gb     Device  3: (UUID: MIG-95122cfc-47a4-56a1-8bba-d83a7fdff025)
  MIG 1g.10gb     Device  4: (UUID: MIG-b735d920-7a4e-58c4-80a2-c8e59e97a577)
  MIG 1g.10gb     Device  5: (UUID: MIG-9d197e9c-ca65-5a8c-803c-4af337ec4b5a)
  MIG 1g.10gb     Device  6: (UUID: MIG-4dbfe475-3001-5cb9-ac01-c2acb56392ab)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-7ff07158-723e-9284-233a-abb461738d29)
  MIG 1g.10gb     Device  0: (UUID: MIG-baaf6225-2104-5faf-94d8-8db41ff37828)

LGTM, merging

@ajdecon ajdecon merged commit c0ac9b4 into NVIDIA:master Mar 8, 2022
@supertetelman supertetelman deleted the mig-manager-bare-metal branch March 8, 2022 18:38
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants