feat(hpc): implement software environment modules#512
Conversation
…k, facts) Replace HPC baseline stub with full implementation supporting limits, sysctl, packages, directories, and tuned profiles. Add four new modules: - munge: MUNGE authentication service management with key distribution - nfs: NFS server (exports management) and client (fstab/mount) modules - healthcheck: configurable node health checks (munge, NFS, services, GPU, IB) - facts: HPC fact gathering (CPU features, NUMA, hugepages, GPU, InfiniBand) All modules support check_mode, privilege escalation, and idempotent operation. OS detection handles RHEL-family and Debian-family distributions. Closes #487, Closes #490, Closes #491, Closes #470, Closes #467 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Slurm stubs with full implementations: - slurm_config: role-based installation (controller/compute/dbd), user creation, package management, config file templating (slurm.conf, cgroup.conf, gres.conf), service management with restart-on-change - slurm_ops: cluster operations with idempotent state checks - reconfigure, drain (with reason), resume, update_partition via scontrol Both modules support check_mode, privilege escalation, and OS detection. Closes #473, Closes #476 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Lmod and MPI stubs with full implementations: - lmod: Lmod installation, module path creation, profile.d script - mpi_config: OpenMPI/Intel MPI installation, MCA params, Lmod modulefiles - hpc_toolchain (new): package set installer for build_essentials, perf_tools, debug_tools, rdma_userland with OS-specific mappings All modules support check_mode and idempotent operation. Closes #482, Closes #485, Closes #488 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b41a542ed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| run_cmd_ok( | ||
| connection, | ||
| &format!("cp '{}' /etc/munge/munge.key", source), | ||
| context, |
There was a problem hiding this comment.
Transfer key_source from controller before remote copy
key_source is documented as a control-node path, but this branch executes cp on the remote host via run_cmd_ok, so the source file is looked up on the target machine instead of the controller. In normal SSH usage that file does not exist remotely, causing key distribution to fail whenever key_source is used; this path needs an upload step (Connection::upload/upload_content) before writing /etc/munge/munge.key.
Useful? React with 👍 / 👎.
| let first_pkg = packages.split_whitespace().next().unwrap_or(""); | ||
| let check_cmd = match os_family { | ||
| "rhel" => format!("rpm -q {} >/dev/null 2>&1", first_pkg), | ||
| _ => format!("dpkg -s {} >/dev/null 2>&1", first_pkg), |
There was a problem hiding this comment.
Check full Slurm package set before skipping install
The install guard only queries first_pkg, so the module treats the role as installed when that one package exists even if the rest of the required set is missing. For example, on controller nodes slurm-slurmctld can be present while slurm/slurm-perlapi is absent, and this code will skip installation and leave a broken role configuration. The check should validate the entire packages list.
Useful? React with 👍 / 👎.
| let (in_fstab, _, _) = run_cmd( | ||
| connection, | ||
| &format!("grep -qF '{}:{}' /etc/fstab", server, export_path), | ||
| context, |
There was a problem hiding this comment.
Match full fstab line before treating NFS mount as present
This presence test only searches for server:export, so it misses drift in mount point or mount options. If an existing /etc/fstab line uses the same export with a different mount point, the module skips adding the desired entry and then calls mount '<mount_point>', which fails because that mount point has no matching fstab record. Compare/update the exact expected fstab entry instead of only the export prefix.
Useful? React with 👍 / 👎.
Summary
lmodstub with full Lmod installation, module path management, and profile.d integrationmpi_configstub with OpenMPI/Intel MPI installation, MCA parameter config, and optional Lmod modulefileshpc_toolchainmodule for OS-specific package set installation (build_essentials, perf_tools, debug_tools, rdma_userland)Closes #482, Closes #485, Closes #488
Test plan
cargo build --features hpc,slurm,gpu,ofed,parallel_fscompiles cleanlyhpc_toolchainmodule registered inModuleRegistrycheck_modepaths return correctly for all three modules🤖 Generated with Claude Code