Skip to content

Conversation

kbendl
Copy link
Contributor

@kbendl kbendl commented May 5, 2025

Allows Features, etc., to be added to partitions.
Also added #comment lines for readability to highlight partitions/groups sections.

Allows Features, etc., to be added to partitions.
@kbendl kbendl requested a review from a team as a code owner May 5, 2025 18:43
@sjpb
Copy link
Collaborator

sjpb commented May 7, 2025

@jovial this was my suggestion for adding arbitrary node parameters. But having seen your #181 , should we actually be adding node parameters on a groupvar level, rather than in partitions, to avoid the same clash issue with nodes in multiple partitions?

@jovial
Copy link
Contributor

jovial commented May 7, 2025

@jovial this was my suggestion for adding arbitrary node parameters. But having seen your #181 , should we actually be adding node parameters on a groupvar level, rather than in partitions, to avoid the same clash issue with nodes in multiple partitions?

I personally think that using group and host vars would be a more natural way of setting per host configuration options. I am now wondering if that is consistent with what we currently do with gres which is a property on the NodeName line, but defined as part of the partition definition. To switch to the host/group vars approach I think we'd need to support something like:

# group_vars/inventory/a100/openhpc

# fallback to value defined on partition if this isn't defined for backwards compatibility
openhpc_gres:
    - conf: "gpu:A100:2"
      file:  /dev/nvidia[0-1]

Are there any other inconsistencies like this?

So you might argue that this PR is consistent with the current approach and I should change mine to work similarly. Like you commented on my PR, I could make mine error out if a inconsistency was detected. On the other hand, this could be a good opportunity to change path before we describe too many properties in the partition definition if we think that using ansible inventory is the better approach.

If using host_vars/groups vars, it could look something like this to the end user:

# group_vars/inventory/a100/openhpc
openhpc_node_params:
    Features: "gpu,a100"

@sjpb
Copy link
Collaborator

sjpb commented May 7, 2025

Using inventory does seem more "correct", TBH. I think the complication is that actually the inventory groups referenced in openhpc_slurm_partitions actually have openhpc_cluster_name + '_' prefix. Which is annoying in the appliance where the openhpc_cluster_name is usually used to distinguish staging/production. Obvs per-node vars could be defined for inventory groups which don't have those prefixes, but currently TF doesn't auto-define those groups (I was wondering if it should for another client where we need to set vars for os-level things on the tf group anyway).

@sjpb
Copy link
Collaborator

sjpb commented May 7, 2025

I don't think I mind about removing gres from the partition config being backward incompatible at least (although I do generally try hard to avoid that). Between the people in this thread we've got awareness of all appliances using it.

@sjpb sjpb changed the base branch from feat/no-ohpc to feat/no-ohpc-node_params May 9, 2025 09:29
@sjpb sjpb merged commit 64ed86c into stackhpc:feat/no-ohpc-node_params May 9, 2025
4 of 34 checks passed
@sjpb
Copy link
Collaborator

sjpb commented May 9, 2025

Merging so I can fix CI

sjpb added a commit that referenced this pull request May 9, 2025
* Add group.node_params to partitions/groups. (#182)

Allows Features, etc., to be added to partitions.

* update SelectType from legacy to current default (#167)

---------

Co-authored-by: Kurt Bendl <kbendl@nrel.gov>
sjpb added a commit that referenced this pull request Sep 9, 2025
* remove drain and resume functionality

* allow install and runtime taskbooks to be used directly

* fix linter complaints

* fix slurmctld state

* move common tasks to pre.yml

* remove unused openhpc_slurm_service

* fix ini_file use for some community.general versions

* fix var precedence in molecule test13

* fix var precedence in all molecule tests

* fix slurmd always starting on control node

* move install to install-ohpc.yml

* remove unused ohpc_slurm_services var

* add install-generic for binary-only install

* distinguish between system and user slurm binaries for generic install

* remove support for CentOS7 / OpenHPC

* remove post-configure, not needed as of slurm v20.02

* add openmpi/IMB-MPI1 by default for generic install

* allow removal of slurm.conf options

* update README

* enable openhpc_extra_repos for both generic and ohpc installs

* README tweak

* add openhpc_config_files parameter

* change library_dir to lib_dir

* fix perms

* fix/silence linter warnings

* remove packages only required for hpctests

* document openhpc_config_files restart behaviour

* bugfix missing newline in slurm.conf

* make path for slurm.conf configurable

* make slurm.conf template src configurable

* symlink slurm user tools so monitoring works

* fix slurm directories

* fix slurmdbd path for non-default slurm.conf paths

* default gres.conf to correct directory

* document <absent> for openhpc_config

* minor merge diff fixes

* Fix EPEL not getting installed

* build RL9.3 container images with systemd

* allow use on image containing slurm binaries

* prepend slurm binaries to PATH instead of symlinking

* ensure cgroup.conf is always next to slurm.conf and allow overriding template

* Add group.node_params to partitions/groups. (#182) (#185)

* Add group.node_params to partitions/groups. (#182)

Allows Features, etc., to be added to partitions.

* update SelectType from legacy to current default (#167)

---------

Co-authored-by: Kurt Bendl <kbendl@nrel.gov>

* update readme

* fixup mode parameters

* tidy slurmd restart line

---------

Co-authored-by: Kurt Bendl <kbendl@nrel.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants