-
Notifications
You must be signed in to change notification settings - Fork 19
Support fully-automatic GRES configuration for nvml #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @sjpb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to GRES (Generic Resource) configuration within the OpenHPC environment, specifically targeting NVIDIA GPUs. It enables fully-automatic GRES detection using the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for fully-automatic GRES configuration for NVIDIA GPUs using NVML. The changes are extensive, including a new Ansible module to query GPU info, updates to Ansible tasks and templates, and significant documentation revisions. Overall, the implementation is solid, but there are a few critical issues that need addressing. Specifically, the gres.conf.j2
template can generate an invalid configuration file, the new gpu_info.py
module has incorrect documentation, and there are some errors in the README examples. I've provided detailed comments and suggestions for these points.
Manual testing was carried out using a slurm appliance deployment - the The cluster was configured with the following compute nodes:
So this was intended to cover all combinations of gpu/no gpu/drivers/no drivers, and also test how configuration for multiple nodes is combined. Test 1 - new general autodetection# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
- name: gpu
- name: normal
- name: no_smi Result @ v5: $ cat /etc/slurm/gres.conf
AutoDetect=nvml
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
Gres=gpu:H200:8(S:0-1) OK. Test 2 - previous autodetection# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
- name: gpu
gres_autodetect: nvml
gres:
- conf: gpu:H200:8
- name: normal
- name: no_smi Result @ v5: $ cat /etc/slurm/gres.conf
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] AutoDetect=nvml Name=gpu Type=H200
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
Gres=gpu:H200:8(S:0-1) OK. Test 3 - general autodetection w/ manual override for fewer GPUs# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
- name: gpu
gres_autodetect: 'off'
gres:
- conf: gpu:H200:4
file: /dev/nvidia[0-3]
- name: normal
- name: no_smi Result @ v5: $ cat /etc/slurm/gres.conf
AutoDetect=nvml
NodeName=steveb-test-gpu-[0-1] AutoDetect=off Name=gpu Type=H200 File=/dev/nvidia[0-3]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Gres=gpu:H200:4
$ scontrol show node steveb-test-gpu-0 | grep Gres
Gres=gpu:H200:4 Result @ v6: $ scontrol show config | grep GresTypes
GresTypes = gpu Others the same as for v4. OK. Test4 - manual GRES with no autodetection# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
- name: gpu
gres:
- conf: gpu:H200:8
file: /dev/nvidia[0-7]
- name: normal
- name: no_smi Results @ v5: $ cat /etc/slurm/gres.conf
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] Name=gpu Type=H200 File=/dev/nvidia[0-7]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
Gres=gpu:H200:8
OK.
## Test 5 - no gres, no autodetection:
```yaml
# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
- name: gpu
- name: normal
- name: no_smi Results @ v5: failed, because $ grep GresTypes /etc/slurm/slurm.conf
$ cat /etc/slurm/gres.conf
AutoDetect=off
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2
$ scontrol show node steveb-test-gpu-0 | grep Gres
Gres=(null) |
Also a load of local testing on GresTypes calculation between v5 and v6. |
Tested in slurm appliance CI here: stackhpc/ansible-slurm-appliance#820 |
Makes it possible to configure GRES for NVIDIA GPUs using
nvml
configuration with a single global variable.openhpc_gres_autodetect
to configure GRES autodetection for all nodegroups.GresTypes
to be added to configuration.openhpc_nodegroups[].gres[].conf
ifopenhpc_gres_autodetect
oropenhpc_nodegroups[].gres_autodetect
are set tonvml
.Some additional notes on why this PR is how it is:
Requirements
a. It has to be possible to turn it on at the top level
b. When enabled, it has to "not fail" on instances without NVIDIA GPUs, or where
nvidia-smi
is not installednvml
autoconfiguration currently definedThis means having a top-level parameter with overrides, which is actually what
gres.conf
supports for theAutoDetection
parameter. To avoid this being even more confusing than it is, the top-level/override behaviour in this role therefore needed to match Slurm's.Slurm configuration and terminology
To try to unconfuse things:
slurm.conf
the Node entry would begpu:H100:4
. This is described asname:type:number
in theslurm.conf
andgres.conf
docs. However note theslurm.conf
parameterGresTypes
actually contains thename
part, not thetype
part.File
parameter ingres.conf
. The definition of name/type/number inslurm.conf
is still required!