-
Notifications
You must be signed in to change notification settings - Fork 312
(3.3.0 3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures
The cluster may hit issues during scale-up operations with new nodes joining after a cluster update where shared EBS volumes have been removed from the SharedStorages section of the configuration.
In particular, the error occurs if the removed EBS had a MountDir path that's a sub-path of other volumes shared in the cluster. For example, removing a custom shared storage with a mount directory as /shared
, /slurm
, /parallelcuster/shared
or /intel
would corrupt the ParallelCluster NFS export paths from the head node, causing the mount operations on the compute nodes to fail.
More in general, the issue happens whenever you remove a shared EBS with mount path /path_1
and another system or custom EBS volume has an export path as /path_2/path_1
.
This is caused by an issue in our code to unmount shared EBS volumes, which deletes lines from /etc/exports
that match the regular expression of the /<MountDir>
without checking if it starts with ^/<MountDir>
. For example, if you have multiple SharedEBS volumes with MountDir that are contains the path name of each other: path_1
and path_2/path_1
. Both paths ( /path_1
and /path_2/path_1)
are written in /etc/exports
. If you update the cluster to remove shared EBS with path_1
, the regular expression with match lines that have /path_1
and also unintentionally removed /path_2/path_1
, further cluster operations like launching nodes or submitting jobs may fail due to required softwares or paths no longer exported over NFS.
- ParallelCluster 3.3.0 -3.5.0
After updating the cluster to remove shared EBS volumes, check /etc/exports
to see if there’s missing lines in /etc/exports
, make sure your /etc/exports
file contains the following lines plus any additional export based on the EBS shared storage set into cluster configuration:
/home 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/parallelcluster/shared 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/intel 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
/opt/slurm 192.168.0.0/17(rw,sync,no_root_squash) 192.168.128.0/17(rw,sync,no_root_squash)
- If they are missing, stop the cluster compute fleet with:
pcluster update-compute-fleet —-cluster-name<cluster-name> —-status STOP_REQUESTED
- If any of these lines are missing or any shared EBS
MountDir
is missing , add them back to/etc/exports
, then run the following command the re-export the NFS shareds.
sudo exportfs -ra
- Start the cluster compute fleet with:
pcluster update-compute-fleet —-cluster-name <cluster-name> —-status START_REQUESTED