Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have same slurm.conf among nodes and controller #1182

Merged
merged 3 commits into from
Jun 17, 2022

Conversation

seyong-um
Copy link
Contributor

@seyong-um seyong-um commented Jun 15, 2022

Share slurm.conf to have same of it - in case of configuring nodes and/or partitions.
Locate slurm.conf at /sw/.slurm/slurm.conf of controller where the /sw is nfs mount point to the all nodes.
Let /etc/slurm/slurm.conf be a soft link to it from nodes and controller.

By Hyundai Motor Group / AIRS Company
Signed-off-by: Seyong Um seyong.um@hyundai.com

Share slurm.conf to have same of it - in case of configuring nodes and/or partitions.
Locate slurm.conf at /sw/.slurm/slurm.conf of controller where the /sw is nfs mount point to all of nodes.
Let /etc/slurm/slurm.conf be a soft link to it from nodes and controller.

Signed-off-by: Seyong Um <seyong.um@hyundai.com>
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seyong-um : This functionality looks useful to support, and I know this is a common pattern in a lot of Slurm clusters.

However, I don't think we want to enable this by default, as we have several known use cases where the Slurm controller and the compute nodes don't share an NFS filesystem.

Would you be willing to put this behind a flag variable? I.e., add a boolean variable like slurm_conf_symlink which defaults to false, and only set up the symlink if it's set to true. Otherwise, use the existing logic.

The flag slurm_conf_symlink will allow modes to share slurm.conf via nfs.

Signed-off-by: Seyong Um <seyong.um@hyundai.com>
@seyong-um
Copy link
Contributor Author

I added share-slurm-conf flag as you suggested. The default value is false, it is existing logic, once it has been set to true then new logic will run.

@seyong-um seyong-um requested a review from ajdecon June 16, 2022 15:05
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI tests are currently having issues, but manual testing of the cluster setup process looks good from my end.

Added a couple of inline comments with additional changes. Once those are done I can merge.

config.example/group_vars/slurm-cluster.yml Show resolved Hide resolved
roles/pyxis/defaults/main.yml Outdated Show resolved Hide resolved
Signed-off-by: Seyong Um <seyong.um@hyundai.com>
@seyong-um seyong-um requested a review from ajdecon June 16, 2022 23:16
Copy link
Collaborator

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! All code review comments are addressed, all CI tests are passing, and manual tests work as well. Looks good to me!

@ajdecon ajdecon merged commit 36cd63d into NVIDIA:master Jun 17, 2022
@dholt dholt mentioned this pull request Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants