-
Notifications
You must be signed in to change notification settings - Fork 312
(3.3.0 3.5.1) Cluster updates can break Slurm accounting functionality
In ParallelCluster versions 3.3.0
up to 3.5.1
, if you enabled the Slurm accounting feature, a cluster-update
operation involving any change to settings under the Scheduling
section, will break Slurm accounting configuration on the system.
Slurm accounting functionality will continue running unless a restart of the slurmdbd
or, more in general, a reboot of the head node is executed.
- ParallelCluster 3.3.0 - 3.5.1, if Slurm Accounting functionality is enabled.
- All OSes are affected.
Any Slurm command trying to query the Slurm database will fail with a Connection refused
error:
$ sacct
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:ip-10-0-0-213:6819: Connection refused
And the slurmdbd
service will be in failed
state.
You can double check that, as a consequence of a update-cluster
, you have a default dummy
value for the StoragePass
parameter in the slurm_parallelcluster_slurmdbd.conf
file:
$ sudo grep StoragePass /opt/slurm/etc/slurm_parallelcluster_slurmdbd.conf
StoragePass=dummy
You can restore the original password by running the following command from the head node:
sudo /opt/parallelcluster/scripts/slurm/update_slurm_database_password.sh
Check slurmdbd
service status in the head node
sudo systemctl status slurmdbd
Only if the the service is in a failed
state, restart it
sudo systemctl restart slurmdbd
Verify that the Slurm accounting functionality is back
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
...
You can find more details about Slurm accounting in ParallelCluster in the official documentation: https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-accounting-v3.html#slurm-accounting-considerations-v3