Bug description
If a cluster is created with a custom Slurm configuration that places static compute nodes into more than one partition, then ParallelCluster will attempt to launch as many EC2 instances for a given node as the number of partitions that node belongs to. This will result in over scaling and node termination due to multiple instances backing a single node.
Affected versions (OSes, schedulers)
- ParallelCluster versions >= 3.0.0 and <= 3.6.0 on all OSs.
- Only the Slurm scheduler is affected.
Mitigation
You can find a detailed explanation and the mitigation of the problem in (3.0.0-3.6.0) Compute Nodes Belonging To More Than One Partition Causes Compute To Overscale