-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert nvidia-mig.yml to leverage the nvidia-mig-manager systemd service #1106
Conversation
47b4490
to
502a338
Compare
502a338
to
64f29bb
Compare
Just had a chat with the lead dev for the MIG Manager and we can remove the dependency on Docker and GitHub and instead pull the release .deb and .rpm files down for MIG manager install via packages. |
88aca18
to
b8ffa2e
Compare
Enable/Disable/Reconfigure tested on a DGX Station V100 (no-op) and DGX A100. |
d9dffa1
to
744d2a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@supertetelman : My understanding was that nvidia-mig-parted
should know how to stop and restart any necessary system daemons, but that doesn't seem to have worked in my test... 🤔
Tested on a freshly-installed DGX A100 with DGX OS 5.1.1.
$ git clone https://github.com/nvidia/deepops
$ cd deepops
$ git checkout -b supertetelman-mig-manager-bare-metal master
$ git pull https://github.com/supertetelman/deepops.git mig-manager-bare-metal
$ ./scripts/setup.sh
(edited config/inventory to add localhost)
$ ansible-playbook -b -l localhost -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml
... <snip> ...
TASK [Apply MIG configuration] *************************************************************************************************************************************************************************************************************fatal: [localhost]: FAILED! => changed=true
cmd:
- nvidia-mig-parted
- apply
- -f
- /etc/nvidia-mig-manager/config.yml
- -c
- all-1g.10gb
delta: '0:00:12.795265'
end: '2022-02-22 15:44:32.609368'
msg: non-zero return code
rc: 1
start: '2022-02-22 15:44:19.814103'
stderr: |-
time="2022-02-22T15:44:32-08:00" level=error msg="The following GPUs could not be reset:\n GPU 00000000:07:00.0: In use by another client\n GPU 00000000:0F:00.0: In use by another client\n GPU 00000000:47:00.0: In use by another client\n GPU 00000000:4E:00.0: In use by another client\n GPU 00000000:87:00.0: In use by another client\n GPU 00000000:90:00.0: In use by another client\n GPU 00000000:B7:00.0: In use by another client\n GPU 00000000:BD:00.0: In use by another client\n\n8 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.\n"
time="2022-02-22T15:44:32-08:00" level=fatal msg="Error resetting all GPUs: exit status 255"
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
No CUDA applications running, so this should just be system services.
Anything special I need to do here?
Your understanding is correct, the only default services that mig-manager will not be able to stop are K8s services. It is assumed that with K8s you are using MIG Manager in the GPU Operator, otherwise, this would be a bug with the default mig manager deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@supertetelman : The issue I'm running into looks like an issue with the underlying tool, but AFAICT the Ansible itself is doing what it should! 🤷🏻♂️
I'm approving this so you can merge when you want, but your call on whether to merge now or hold off.
tags: disable, never | ||
when: deepops_mig_devices | default("") != "" and deepops_mig_devices | default("") != "all" | ||
- name: Apply MIG configuration | ||
command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion in an internal bug, this line should be:
command: nvidia-mig-parted apply -f {{ mig_manager_config }} -c {{ mig_manager_profile }} -k /etc/nvidia-mig-manager/hooks.yaml
Where you might choose to put the hooks file in a variable as well. :)
ffbc364
to
ed18f05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested successfully on a DGX A100. 🎉
$ ansible-playbook -b -l $(hostname) -e mig_manager_profile="all-1g.10gb" playbooks/nvidia-software/nvidia-mig.yml
...
$ nvidia-smi -L | head
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-fcb8230f-657b-e58a-4694-dfb23bf1c4d4)
MIG 1g.10gb Device 0: (UUID: MIG-42633f28-45cf-50d0-a6cb-2300e3990588)
MIG 1g.10gb Device 1: (UUID: MIG-1efacfc1-b1cc-5014-8304-3d55fb6830bf)
MIG 1g.10gb Device 2: (UUID: MIG-7cd2dabf-73d6-5549-8349-0176edb3910a)
MIG 1g.10gb Device 3: (UUID: MIG-95122cfc-47a4-56a1-8bba-d83a7fdff025)
MIG 1g.10gb Device 4: (UUID: MIG-b735d920-7a4e-58c4-80a2-c8e59e97a577)
MIG 1g.10gb Device 5: (UUID: MIG-9d197e9c-ca65-5a8c-803c-4af337ec4b5a)
MIG 1g.10gb Device 6: (UUID: MIG-4dbfe475-3001-5cb9-ac01-c2acb56392ab)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-7ff07158-723e-9284-233a-abb461738d29)
MIG 1g.10gb Device 0: (UUID: MIG-baaf6225-2104-5faf-94d8-8db41ff37828)
LGTM, merging
This update does the following:
The new playbook is meant to be run on bare-metal and slurm systems. It does the following across all nodes:
Test plan:
Due to the requirements for this playbook, no automated testing will/can be added. Manual testing on a fresh MIG-capable system without mig-parted installed, again with mig-parted installed, and with mig enabled and then disabled should be done in addition to a test on a non-mig system.