Skip to content

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Oct 9, 2025

Makes it possible to configure GRES for NVIDIA GPUs using nvml configuration with a single global variable.

  • Adds new variable openhpc_gres_autodetect to configure GRES autodetection for all nodegroups.
  • Removes need for GresTypes to be added to configuration.
  • Removes need to specify openhpc_nodegroups[].gres[].conf if openhpc_gres_autodetect or openhpc_nodegroups[].gres_autodetect are set to nvml.

Some additional notes on why this PR is how it is:

Requirements

  1. That it can be unconditionally enabled for CaaS clusters. This implies that:
    a. It has to be possible to turn it on at the top level
    b. When enabled, it has to "not fail" on instances without NVIDIA GPUs, or where nvidia-smi is not installed
  2. That configuration is backward compatible for clusters with nvml autoconfiguration currently defined

This means having a top-level parameter with overrides, which is actually what gres.conf supports for the AutoDetection parameter. To avoid this being even more confusing than it is, the top-level/override behaviour in this role therefore needed to match Slurm's.

Slurm configuration and terminology

To try to unconfuse things:

  • Note that with e.g. 4x H200 GPUs in a node, in slurm.conf the Node entry would be gpu:H100:4. This is described as name:type:number in the slurm.conf and gres.conf docs. However note the slurm.conf parameter GresTypes actually contains the name part, not the type part.
  • Natively, autodetection (any method) only avoids the need to define the File parameter in gres.conf. The definition of name/type/number in slurm.conf is still required!

@sjpb sjpb requested a review from a team as a code owner October 9, 2025 13:50
Copy link

Summary of Changes

Hello @sjpb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to GRES (Generic Resource) configuration within the OpenHPC environment, specifically targeting NVIDIA GPUs. It enables fully-automatic GRES detection using the nvml library, streamlining the setup process for GPU-enabled nodes. This feature allows administrators to activate GRES with a single global variable, while also providing per-nodegroup overrides, ensuring backward compatibility and graceful operation on systems without NVIDIA GPUs or nvidia-smi. The changes simplify GRES management by automating GresTypes population and reducing manual gres.conf entries.

Highlights

  • Automatic NVIDIA GRES Configuration: Introduced a fully-automatic GRES configuration mechanism for NVIDIA GPUs using the nvml library, simplifying the setup process.
  • Global and Nodegroup Autodetection: Added openhpc_gres_autodetect as a global variable and gres_autodetect as a per-nodegroup parameter, allowing flexible control over GRES autodetection.
  • New GPU Info Ansible Module: Implemented gpu_info.py, a new Ansible module that queries NVIDIA GPU models and counts using nvidia-smi to generate GRES configuration strings.
  • Dynamic GresTypes Setting: GresTypes in slurm.conf is now automatically populated based on detected or explicitly defined GRES, removing the need for manual configuration.
  • Simplified GRES Configuration: Updated documentation and logic to simplify GRES setup, especially for nvml autodetection, where gres.file is no longer required.
  • Improved Validation: Modified validation tasks to ensure nodegroups have names and to conditionally check for the file key in GRES definitions only when autodetection is explicitly off.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for fully-automatic GRES configuration for NVIDIA GPUs using NVML. The changes are extensive, including a new Ansible module to query GPU info, updates to Ansible tasks and templates, and significant documentation revisions. Overall, the implementation is solid, but there are a few critical issues that need addressing. Specifically, the gres.conf.j2 template can generate an invalid configuration file, the new gpu_info.py module has incorrect documentation, and there are some errors in the README examples. I've provided detailed comments and suggestions for these points.

@sjpb
Copy link
Collaborator Author

sjpb commented Oct 10, 2025

Manual testing was carried out using a slurm appliance deployment - the vX matches commits above commits.

The cluster was configured with the following compute nodes:

  • 2x baremetal nodes with 8x H200 GPUs, using an image with cuda role (so nvidia drivers, nvidia-smi)
  • 1x VM node with the same image and no GPU
  • 1x VM node with the default stackhpc image (so no nvidia drivers or nvidia-smi) and no GPU

So this was intended to cover all combinations of gpu/no gpu/drivers/no drivers, and also test how configuration for multiple nodes is combined.

Test 1 - new general autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
   - name: gpu
   - name: normal
   - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=nvml
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
  Gres=gpu:H200:8(S:0-1)

OK.

Test 2 - previous autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
    gres_autodetect: nvml
    gres:
      - conf: gpu:H200:8
  - name: normal
  - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] AutoDetect=nvml Name=gpu Type=H200
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:8(S:0-1)

OK.

Test 3 - general autodetection w/ manual override for fewer GPUs

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
openhpc_nodegroups:
  - name: gpu
    gres_autodetect: 'off'
    gres:
      - conf: gpu:H200:4
        file: /dev/nvidia[0-3]
  - name: normal
  - name: no_smi

Result @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=nvml
NodeName=steveb-test-gpu-[0-1] AutoDetect=off Name=gpu Type=H200 File=/dev/nvidia[0-3]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:4
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:4

Result @ v6:

$ scontrol show config | grep GresTypes
 GresTypes               = gpu

Others the same as for v4.

OK.

Test4 - manual GRES with no autodetection

# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
    gres:
      - conf: gpu:H200:8
        file: /dev/nvidia[0-7]
  - name: normal
  - name: no_smi

Results @ v5:

$ cat /etc/slurm/gres.conf 
AutoDetect=off
NodeName=steveb-test-gpu-[0-1] Name=gpu Type=H200 File=/dev/nvidia[0-7]
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2  Gres=gpu:H200:8
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=gpu:H200:8

OK.

## Test 5 - no gres, no autodetection:

```yaml
# environments/6gai/inventory/group_vars/all/openhpc.yml:
openhpc_nodegroups:
  - name: gpu
  - name: normal
  - name: no_smi

Results @ v5: failed, because GresTypes ended up as an empty string.
Results @ v6:

$ grep GresTypes /etc/slurm/slurm.conf

$ cat /etc/slurm/gres.conf 
AutoDetect=off
$ grep NodeName=steveb-test-gpu /etc/slurm/slurm.conf
NodeName=steveb-test-gpu-[0-1] Features=nodegroup_gpu State=UNKNOWN RealMemory=1469905 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2
$ scontrol show node steveb-test-gpu-0 | grep Gres
   Gres=(null)

@sjpb
Copy link
Collaborator Author

sjpb commented Oct 10, 2025

Also a load of local testing on GresTypes calculation between v5 and v6.

@sjpb
Copy link
Collaborator Author

sjpb commented Oct 10, 2025

Tested in slurm appliance CI here: stackhpc/ansible-slurm-appliance#820

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant