Skip to content

Comments

feat(hpc): implement software environment modules#512

Merged
adolago merged 3 commits intomainfrom
hpc/software-stack
Feb 11, 2026
Merged

feat(hpc): implement software environment modules#512
adolago merged 3 commits intomainfrom
hpc/software-stack

Conversation

@adolago
Copy link
Owner

@adolago adolago commented Feb 11, 2026

Summary

  • Replace lmod stub with full Lmod installation, module path management, and profile.d integration
  • Replace mpi_config stub with OpenMPI/Intel MPI installation, MCA parameter config, and optional Lmod modulefiles
  • Add new hpc_toolchain module for OS-specific package set installation (build_essentials, perf_tools, debug_tools, rdma_userland)

Closes #482, Closes #485, Closes #488

Test plan

  • cargo build --features hpc,slurm,gpu,ofed,parallel_fs compiles cleanly
  • New hpc_toolchain module registered in ModuleRegistry
  • check_mode paths return correctly for all three modules

🤖 Generated with Claude Code

adolago and others added 3 commits February 11, 2026 14:17
…k, facts)

Replace HPC baseline stub with full implementation supporting limits,
sysctl, packages, directories, and tuned profiles. Add four new modules:

- munge: MUNGE authentication service management with key distribution
- nfs: NFS server (exports management) and client (fstab/mount) modules
- healthcheck: configurable node health checks (munge, NFS, services, GPU, IB)
- facts: HPC fact gathering (CPU features, NUMA, hugepages, GPU, InfiniBand)

All modules support check_mode, privilege escalation, and idempotent
operation. OS detection handles RHEL-family and Debian-family distributions.

Closes #487, Closes #490, Closes #491, Closes #470, Closes #467

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Slurm stubs with full implementations:

- slurm_config: role-based installation (controller/compute/dbd), user
  creation, package management, config file templating (slurm.conf,
  cgroup.conf, gres.conf), service management with restart-on-change
- slurm_ops: cluster operations with idempotent state checks -
  reconfigure, drain (with reason), resume, update_partition via scontrol

Both modules support check_mode, privilege escalation, and OS detection.

Closes #473, Closes #476

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Lmod and MPI stubs with full implementations:

- lmod: Lmod installation, module path creation, profile.d script
- mpi_config: OpenMPI/Intel MPI installation, MCA params, Lmod modulefiles
- hpc_toolchain (new): package set installer for build_essentials,
  perf_tools, debug_tools, rdma_userland with OS-specific mappings

All modules support check_mode and idempotent operation.

Closes #482, Closes #485, Closes #488

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b41a542ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +241 to +244
run_cmd_ok(
connection,
&format!("cp '{}' /etc/munge/munge.key", source),
context,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Transfer key_source from controller before remote copy

key_source is documented as a control-node path, but this branch executes cp on the remote host via run_cmd_ok, so the source file is looked up on the target machine instead of the controller. In normal SSH usage that file does not exist remotely, causing key distribution to fail whenever key_source is used; this path needs an upload step (Connection::upload/upload_content) before writing /etc/munge/munge.key.

Useful? React with 👍 / 👎.

Comment on lines +205 to +208
let first_pkg = packages.split_whitespace().next().unwrap_or("");
let check_cmd = match os_family {
"rhel" => format!("rpm -q {} >/dev/null 2>&1", first_pkg),
_ => format!("dpkg -s {} >/dev/null 2>&1", first_pkg),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check full Slurm package set before skipping install

The install guard only queries first_pkg, so the module treats the role as installed when that one package exists even if the rest of the required set is missing. For example, on controller nodes slurm-slurmctld can be present while slurm/slurm-perlapi is absent, and this code will skip installation and leave a broken role configuration. The check should validate the entire packages list.

Useful? React with 👍 / 👎.

Comment on lines +366 to +369
let (in_fstab, _, _) = run_cmd(
connection,
&format!("grep -qF '{}:{}' /etc/fstab", server, export_path),
context,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match full fstab line before treating NFS mount as present

This presence test only searches for server:export, so it misses drift in mount point or mount options. If an existing /etc/fstab line uses the same export with a different mount point, the module skips adding the desired entry and then calls mount '<mount_point>', which fails because that mount point has no matching fstab record. Compare/update the exact expected fstab entry instead of only the export prefix.

Useful? React with 👍 / 👎.

@adolago adolago merged commit 19a15da into main Feb 11, 2026
20 of 23 checks passed
@adolago adolago deleted the hpc/software-stack branch February 11, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPC toolchain package sets MPI configuration modules Lmod / Environment Modules support

1 participant