Skip to content

[Subnet Prioritization] Support capacity-optimized-prioritized and prioritized Allocation Strategy #6865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: develop
Choose a base branch
from

Conversation

Allenz5
Copy link

@Allenz5 Allenz5 commented Jun 9, 2025

Description of changes

  • Extend AllocationStrategy to include capacity-optimized-prioritized and prioritized options
  • Add EnableSingleAvailabilityZone as a new configuration parameter
  • Extend instances_validators.py::InstancesAllocationStrategyValidator to verify that the new AllocationStrategy works correctly with the specified CapacityType
  • Add and register a new validator to verify that EnableSingleAvailabilityZone works with capacity-optimized-prioritized and prioritized Allocation Strategy
  • Move AllocationStrategy Enum from config_cluster.py to common.py
  • Modify slurm.full_config.snapshot.yaml to include the new EnableSingleAvailabilityZone configuration parameter

Tests

  • Extend test_instances_validators.py::test_instances_allocation_strategy_validator to include new test cases
  • Extend test_all_validators.py::test_slurm_validators_are_called_with_correct_argument to verify that EnableSingleAvailabilityZoneValidator is properly registered
  • Add new test_networking_validators.py::test_enable_single_availability_zone_validator test module

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Allenz5 and others added 13 commits June 3, 2025 14:56
…d to AllocationStrategy configuration; Add new configuration of enable_single_availability_zone;

Signed-off-by: Hanxuan Zhang <hanxuanz@amazon.com>
…nfig.cluster_config to pcluster.config.common
…ster_config.py to common.py"

This reverts commit 13d8cfa.
…dator in cluster_config.py and add registration tests
@Allenz5 Allenz5 requested review from a team as code owners June 9, 2025 20:37
@@ -3179,6 +3191,11 @@ def _register_validators(self, context: ValidatorContext = None): # noqa: C901
]
for validator in flexible_instance_types_validators:
self._register_validator(validator, **validator_args)
self._register_validator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the EnableSingleAvailabilityZoneValidator in a loop for each compute resource when we define the Parameters to be queue Specific and do not take any values from those said Compute resources?

@@ -752,6 +751,7 @@ class SlurmQueueNetworkingSchema(QueueNetworkingSchema):
PlacementGroupSchema, metadata={"update_policy": UpdatePolicy.MANAGED_PLACEMENT_GROUP}
)
proxy = fields.Nested(QueueProxySchema, metadata={"update_policy": UpdatePolicy.QUEUE_UPDATE_STRATEGY})
enable_single_availability_zone = fields.Bool(metadata={"update_policy": UpdatePolicy.QUEUE_UPDATE_STRATEGY})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we test the use of keeping this UpdatePolicy.QUEUE_UPDATE_STRATEGY for this paramter?

self._add_failure(
f"Compute Resource {compute_resource_name} is using a SPOT CapacityType but the "
f"Allocation Strategy specified is {allocation_strategy.value}. SPOT CapacityType cannot use "
f"'{common.AllocationStrategy.PRIORITIZED.value}' allocation strategy.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! However, lets follow the messaging to specify which AllocationStrategy SPOT Capacity type supports as the last line. Similar to ONDemand

@@ -219,6 +219,7 @@ Scheduling:
Networking:
AdditionalSecurityGroups: null
AssignPublicIp: null
EnableSingleAvailabilityZone: null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep one of them as True?

@@ -224,6 +224,9 @@ def test_slurm_validators_are_called_with_correct_argument(test_datadir, mocker)
single_instance_type_subnet_validator = mocker.patch(
networking_validators + ".SingleInstanceTypeSubnetValidator._validate", return_value=[]
)
enable_single_availability_zone_validator = mocker.patch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lets test both for true and false (values) for enable_single_availability_zone and different AllocationStrategy

@himani2411
Copy link
Contributor

I would suggest to add a title to the PR too

@Allenz5 Allenz5 changed the title [Subnet Prioritization] [Subnet Prioritization] Support capacity-optimized-prioritized and prioritized Allocation Strategy Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants