Support fully-automatic GRES configuration for nvml #202

sjpb · 2025-10-09T13:50:25Z

Makes it possible to configure GRES for NVIDIA GPUs using nvml autodetection with a single global variable.

Adds new variable openhpc_gres_autodetect to configure GRES autodetection for all nodegroups.
Removes need for GresTypes to be added to configuration.
Removes need to specify openhpc_nodegroups[].gres[].conf if openhpc_gres_autodetect or openhpc_nodegroups[].gres_autodetect are set to nvml.

Some additional notes on why this PR is how it is:

Requirements

That it can be unconditionally enabled for CaaS clusters. This implies that:
a. It has to be possible to turn it on at the top level
b. When enabled, it has to "not fail" on instances without NVIDIA GPUs, or where nvidia-smi is not installed
That configuration is backwards-compatible for clusters with nvml autoconfiguration currently defined

This means having a top-level parameter with overrides, which is actually what gres.conf supports for the AutoDetection parameter. To avoid this being even more confusing than it is, the top-level/override behaviour in this role therefore needed to match Slurm's.

Slurm configuration and terminology

To try to unconfuse things:

Note that with e.g. 4x H200 GPUs in a node, in slurm.conf the Node entry would be gpu:H100:4. This is described as name:type:number in the slurm.conf and gres.conf docs. However note the slurm.conf parameter GresTypes actually contains the name part, not the type part.
In Slurm itself, autodetection (any method) only avoids the need to define the File parameter in gres.conf - the definition of name/type/number in slurm.conf is still required!

gemini-code-assist · 2025-10-09T13:50:52Z

Summary of Changes

Hello @sjpb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to GRES (Generic Resource) configuration within the OpenHPC environment, specifically targeting NVIDIA GPUs. It enables fully-automatic GRES detection using the nvml library, streamlining the setup process for GPU-enabled nodes. This feature allows administrators to activate GRES with a single global variable, while also providing per-nodegroup overrides, ensuring backward compatibility and graceful operation on systems without NVIDIA GPUs or nvidia-smi. The changes simplify GRES management by automating GresTypes population and reducing manual gres.conf entries.

Highlights

Automatic NVIDIA GRES Configuration: Introduced a fully-automatic GRES configuration mechanism for NVIDIA GPUs using the nvml library, simplifying the setup process.
Global and Nodegroup Autodetection: Added openhpc_gres_autodetect as a global variable and gres_autodetect as a per-nodegroup parameter, allowing flexible control over GRES autodetection.
New GPU Info Ansible Module: Implemented gpu_info.py, a new Ansible module that queries NVIDIA GPU models and counts using nvidia-smi to generate GRES configuration strings.
Dynamic GresTypes Setting: GresTypes in slurm.conf is now automatically populated based on detected or explicitly defined GRES, removing the need for manual configuration.
Simplified GRES Configuration: Updated documentation and logic to simplify GRES setup, especially for nvml autodetection, where gres.file is no longer required.
Improved Validation: Modified validation tasks to ensure nodegroups have names and to conditionally check for the file key in GRES definitions only when autodetection is explicitly off.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for fully-automatic GRES configuration for NVIDIA GPUs using NVML. The changes are extensive, including a new Ansible module to query GPU info, updates to Ansible tasks and templates, and significant documentation revisions. Overall, the implementation is solid, but there are a few critical issues that need addressing. Specifically, the gres.conf.j2 template can generate an invalid configuration file, the new gpu_info.py module has incorrect documentation, and there are some errors in the README examples. I've provided detailed comments and suggestions for these points.

templates/gres.conf.j2

README.md

library/gpu_info.py

README.md

sjpb · 2025-10-10T11:27:48Z

Manual testing was carried out using a slurm appliance deployment - the vX matches commits above commits.

The cluster was configured with the following compute nodes:

2x baremetal nodes with 8x H200 GPUs, using an image with cuda role (so nvidia drivers, nvidia-smi)
1x VM node with the same image and no GPU
1x VM node with the default stackhpc image (so no nvidia drivers or nvidia-smi) and no GPU

So this was intended to cover all combinations of gpu/no gpu/drivers/no drivers, and also test how configuration for multiple nodes is combined.

Test 1 - new general autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
   - name: gpu
   - name: normal
   - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=nvml
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
  Gres=gpu:H200:8(S:0-1)

OK.

Test 2 - previous autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
    gres_autodetect: nvml
    gres:
      - conf: gpu:H200:8
  - name: normal
  - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] AutoDetect=nvml Name=gpu Type=H200
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:8(S:0-1)

OK.

Test 3 - general autodetection w/ manual override for fewer GPUs

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
  - name: gpu
    gres_autodetect: 'off'
    gres:
      - conf: gpu:H200:4
        file: /dev/nvidia[0-3]
  - name: normal
  - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=nvml
NodeName=steveb-test-gpu-[0-1] AutoDetect=off Name=gpu Type=H200 File=/dev/nvidia[0-3]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:4
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:4

Result @ v6:

$ scontrol show config | grep GresTypes
 GresTypes               = gpu

Others the same as for v4.

OK.

Test4 - manual GRES with no autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
    gres:
      - conf: gpu:H200:8
        file: /dev/nvidia[0-7]
  - name: normal
  - name: no_smi

Results @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] Name=gpu Type=H200 File=/dev/nvidia[0-7]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:8

OK.

## Test 5 - no gres, no autodetection:

```yaml
# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
  - name: normal
  - name: no_smi

Results @ v5: failed, because GresTypes ended up as an empty string.
Results @ v6:

$ grep GresTypes /etc/slurm/slurm.conf

$ cat /etc/slurm/gres.conf 
AutoDetect=off
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=(null)

sjpb · 2025-10-10T11:29:44Z

Also a load of local testing on GresTypes calculation between v5 and v6.

sjpb · 2025-10-10T12:16:08Z

Tested in slurm appliance CI here: stackhpc/ansible-slurm-appliance#820

sjpb · 2025-10-22T11:33:25Z

See full test on Azimuth here: stackhpc/ansible-slurm-appliance#820 (comment)

bertiethorpe

LGTM

sjpb · 2025-10-23T14:48:48Z

Ok some more testing has shown this config is broken in two ways:

#openhpc_gres_autodetect: nvml # so - no autodetection
openhpc_nodegroups:
  - name: gpu
    gres:
      - conf: gpu:H200:8
        file: /dev/nvidia[0-7]
      - conf: test:foo:1

sjpb · 2025-10-23T14:50:38Z

Before 8753e32:

[root@steveb-test-control rocky]# cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] Name=gpu Type=H200 File=/dev/nvidia[0-7]NodeName=steveb-test-gpu-[0-1] Name=test Type=foo

After:

[root@steveb-test-control rocky]# cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] Name=gpu Type=H200 File=/dev/nvidia[0-7]
NodeName=steveb-test-gpu-[0-1] Name=test Type=foo

sjpb · 2025-10-23T14:58:57Z

before d2864a3:

[root@steveb-test-control rocky]# scontrol show config | grep -i grestypes
GresTypes               = (null)

after:

[root@steveb-test-control rocky]# scontrol show config | grep -i grestypes
GresTypes               = gpu,test

Error occured when moving where this default calculation was done, which changed the escaping necessary.

jovial · 2025-10-24T10:35:17Z

Very nice. Do you know if this also works with MIG? Also am I right in assuming that if you are using the old method you won't need to update your config; it should still work as is?

sjpb · 2025-10-24T13:14:00Z

@jovial

Do you know if this also works with MIG

No I don't I'm afraid, I don't have MIG GPUs to test it. TBH I don't have my head around MIG enough to guess

Also am I right in assuming that if you are using the old method you won't need to update your config; it should still work as is?

Yes - and that is my workaround for not being able to test on MIG.

library/gpu_info.py

jovial · 2025-10-24T16:24:34Z

@jovial

Do you know if this also works with MIG

No I don't I'm afraid, I don't have MIG GPUs to test it. TBH I don't have my head around MIG enough to guess

Also am I right in assuming that if you are using the old method you won't need to update your config; it should still work as is?

Yes - and that is my workaround for not being able to test on MIG.

Just what I wanted to hear 😀

sjpb added 9 commits October 8, 2025 10:03

auto set GresTypes

ea1736c

auto gres - v1

a7e8a66

auto gres v2

ccaf02a

auto gres v3

ccd170d

auto gres v4

56317bb

auto gres v5 - proper top-level/overrride

6ebea62

v5 - fix grestypes when none

1d63453

fixup validation

6ca46dd

update README

19bf656

sjpb requested a review from a team as a code owner October 9, 2025 13:50

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

sjpb added 4 commits October 9, 2025 15:00

disable waffly AI PR summary

2484033

fixup README example

d7d1aa9

fixup library boilerplate

c2c72ec

fix README typos

965053c

sjpb force-pushed the feat/auto-gres branch from 258c898 to 965053c Compare October 9, 2025 14:14

try to avoid jmespath failures in CI

ed7f422

sjpb force-pushed the feat/auto-gres branch from a77dffd to ed7f422 Compare October 9, 2025 15:42

sjpb mentioned this pull request Oct 10, 2025

Support automatic GRES configuration for NVIDIA GPUs stackhpc/ansible-slurm-appliance#820

Merged

bertiethorpe previously approved these changes Oct 23, 2025

View reviewed changes

fix multiple gres specs resulting in NodeName= lines missing newlines

8753e32

sjpb dismissed bertiethorpe’s stale review via 8753e32 October 23, 2025 14:49

fix regex for in-nodegroup gres name extraction

d2864a3

jovial approved these changes Oct 24, 2025

View reviewed changes

jovial reviewed Oct 24, 2025

View reviewed changes

library/gpu_info.py Show resolved Hide resolved

sjpb merged commit be61965 into master Oct 27, 2025
51 of 54 checks passed

sjpb deleted the feat/auto-gres branch October 27, 2025 09:31

Support fully-automatic GRES configuration for nvml #202

Support fully-automatic GRES configuration for nvml #202

Uh oh!

Conversation

sjpb commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Slurm configuration and terminology

Uh oh!

gemini-code-assist bot commented Oct 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjpb commented Oct 10, 2025

Test 1 - new general autodetection

Test 2 - previous autodetection

Test 3 - general autodetection w/ manual override for fewer GPUs

Test4 - manual GRES with no autodetection

Uh oh!

sjpb commented Oct 10, 2025

Uh oh!

sjpb commented Oct 10, 2025

Uh oh!

sjpb commented Oct 22, 2025

Uh oh!

bertiethorpe left a comment

Choose a reason for hiding this comment

Uh oh!

sjpb commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented Oct 23, 2025

Uh oh!

sjpb commented Oct 23, 2025

Uh oh!

jovial commented Oct 24, 2025

Uh oh!

sjpb commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jovial commented Oct 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sjpb commented Oct 9, 2025 •

edited

Loading

sjpb commented Oct 23, 2025 •

edited

Loading

sjpb commented Oct 24, 2025 •

edited

Loading