Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add attachment limit scripts to hack/cluster-debugging-scripts #1857

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Dec 8, 2023

Is this a bug fix or adding new feature?
Cluster debugging tool

What is this PR about? / Why do we need it?
NOTE: This PR is an expansion of #1852

These scripts help answer the question: How can I validate that the aws-ebs-csi-driver correctly makes use of all available attachment slots for instance type X?

This question is important because the aws-ebs-csi-driver currently hard-codes the attachment limit for each EC2 Instance family and type (Due to not having an EC2 API that outputs the attachment limit of a given instance type, and whether Non-EBS-volume-attachments count towards that limit).

See README.md for overview of each script. At a high level:

  1. get-attachment-breakdown collects instance data from EC2 API, then calls find-attachment-limit
  2. find-attachment-limit generates and deploys pods with varying amounts of PVs, and finds the maximum amount of volumes the aws-ebs-csi-driver can attach to the node.
  3. Those manifests are generated via generate_example_manifest.go from the template device_slot_test.tmpl

What testing is done?
Tested on m5.large and m7g.large nodegroups. M7g.large has 28 volume limit including ENIs. When instance had 1 root volume + 2 ENIs, Script correctly discovered 25 to be max amount of EBS-backed PVs a pod can be created with.

❯ export CLUSTER_NAME="devcluster"
❯ export NODEGROUP_NAME="ng-attachment-limit-test"
❯ export INSTANCE_TYPE="m7g.large"

❯ eksctl create nodegroup -c "$CLUSTER_NAME" --nodes 1 -t "$INSTANCE_TYPE" -n "$NODEGROUP_NAME"

❯ MIN_VOLUME_GUESS=20 MAX_VOLUME_GUESS=40 POD_TIMEOUT_SECONDS=120 ./get-attachment-breakdown "$NODEGROUP_NAME"
2023-12-08 17:57:59 [INFO] - Currently the instance associated with nodegroup name ng-attachment-limit-test has the following attachments:
2023-12-08 17:58:01 [INFO] - 1 volumes from Block Device Mappings are attached to the instance. (Including the instance's root volume)
2023-12-08 17:58:02 [INFO] - 2 Elastic Network Interfaces (ENIs) are attached to the instance. (NOTE: These ENIs may not count towards volume limit for certain Nitro System instance types. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html)
2023-12-08 17:58:02 [INFO] - Checking how many additional EBS volumes are able to be attached via the aws-ebs-csi-driver. This may take a while...
2023-12-08 17:58:02 [INFO] - Attempting to deploy pod with 30 PVCs on node with label 'eks.amazonaws.com/nodegroup:ng-attachment-limit-test'
2023-12-08 17:58:02 [INFO] - Creating k8s objects associated with manifest /tmp/tmp.CcGYdDXEMq
2023-12-08 17:58:07 [INFO] - Waiting 120 seconds for 'pod/attachment-limit-test-pod to reach condition 'ready'
2023-12-08 18:00:07 [INFO] - Pod with 30 PVCs did not successfully deploy
2023-12-08 18:00:07 [INFO] - Deleting k8s objects associated with manifest /tmp/tmp.CcGYdDXEMq
2023-12-08 18:00:15 [INFO] - Attempting to deploy pod with 25 PVCs on node with label 'eks.amazonaws.com/nodegroup:ng-attachment-limit-test'
2023-12-08 18:00:15 [INFO] - Creating k8s objects associated with manifest /tmp/tmp.Sirt5LQA5B
2023-12-08 18:00:19 [INFO] - Waiting 120 seconds for 'pod/attachment-limit-test-pod to reach condition 'ready'
2023-12-08 18:01:22 [INFO] - Pod with 25 PVCs successfully deployed
2023-12-08 18:01:22 [INFO] - Deleting k8s objects associated with manifest /tmp/tmp.Sirt5LQA5B
2023-12-08 18:01:58 [INFO] - Attempting to deploy pod with 27 PVCs on node with label 'eks.amazonaws.com/nodegroup:ng-attachment-limit-test'
2023-12-08 18:01:58 [INFO] - Creating k8s objects associated with manifest /tmp/tmp.yEbNmyCIKQ
2023-12-08 18:02:03 [INFO] - Waiting 120 seconds for 'pod/attachment-limit-test-pod to reach condition 'ready'
2023-12-08 18:04:03 [INFO] - Pod with 27 PVCs did not successfully deploy
2023-12-08 18:04:03 [INFO] - Deleting k8s objects associated with manifest /tmp/tmp.yEbNmyCIKQ
2023-12-08 18:04:10 [INFO] - Attempting to deploy pod with 26 PVCs on node with label 'eks.amazonaws.com/nodegroup:ng-attachment-limit-test'
2023-12-08 18:04:10 [INFO] - Creating k8s objects associated with manifest /tmp/tmp.c2LShnXELL
2023-12-08 18:04:14 [INFO] - Waiting 120 seconds for 'pod/attachment-limit-test-pod to reach condition 'ready'
2023-12-08 18:06:14 [INFO] - Pod with 26 PVCs did not successfully deploy
2023-12-08 18:06:14 [INFO] - Deleting k8s objects associated with manifest /tmp/tmp.c2LShnXELL
2023-12-08 18:06:30 [INFO] - Success!
2023-12-08 18:06:30 [INFO] - Maximum amount of volumes deployed with pod on node with label 'eks.amazonaws.com/nodegroup:ng-attachment-limit-test': 25
2023-12-08 18:06:30 [INFO] - 25 volumes are able to be attached to the instance.
Attachments for ng-attachment-limit-test
BlockDeviceMappings  ENIs  Available-Attachment-Slots(Validated-by-aws-ebs-csi-driver)
1                    2     25

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 8, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 8, 2023
Copy link

github-actions bot commented Dec 8, 2023

Code Coverage Diff

This PR does not change the code coverage

@AndrewSirenko AndrewSirenko force-pushed the Device-slot-test-k8s branch 2 times, most recently from 9aa12a7 to da81e5a Compare January 11, 2024 20:48
@ConnorJC3
Copy link
Contributor

/lgtm

But you need to run make update (whoever approves feel free to reapply this lgtm if the only diff is shell formatting).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024
@ConnorJC3
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorJC3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot merged commit d684d4d into kubernetes-sigs:master Jan 12, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants