-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selector Terms for AMI Minimum Age #5382
Comments
I'm running into the same bad AMI issue. |
Somewhat relates to #4769 and #1495 which want to alter the AMI selection logic; though, the requirements here are slightly different.
Were you using the default AMI that was passed through with the
Agree that this might also be something that would be nice to have. Since these terms are meant to perform a matching mechanism against the instance types that we choose to launch with, one idea here is that we allow a requirements section in the amiSelectorTerms:
- tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["c5.large"] so you could exclude an AMI indirectly by doing - tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
requirements:
- key: "node.kubernetes.io/instance-type"
operator: DoesNotExist That's obviously more indirect for the use-case, so one other alternative is that we create something like an - tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
exclude/reject: true Realistically, I think the |
+1 The 'NotIn' operator is solely applicable post-outage occurrence and when you possess knowledge of the faulty AMI ID. To entirely circumvent issues such as the 'file limit bad AMI,' the sole approach is to enforce a minimum age requirement, like 1 or 2 weeks, as a preventive measure. |
would you review PRs for the suggested |
@jonathan-innis I took a stab and I drafted a PR. |
@grandich Rather than introducing a minimum age requirement, does it make sense for you to pin the image id on the EC2NodeClass in your |
@jonathan-innis (Apologies if I haven't fully grasped or addressed your rationale regarding the orchestration of this process by a CI platform) A CI platform could potentially manage this orchestration, but we sought a solution independent of such dependency. Also, we aim to steer clear of pinning due to the maintenance overhead and the risk of overlooking AMI updates, potentially resulting in reliance on outdated AMIs. We require an automated solution, whether it be through a minimum age parameter or an automated 'n-1' or 'n-2' mechanism. Regarding testing AMIs, in specific scenarios, the bug may not manifest in lower environments, such as the one of file limits, so in such instances, the age of the AMI universally adopted by the community serves as a more robust assurance. The rationale behind the 'minimum age' concept mirrors that of a 'beta stage' for the AMIs during that period. Frankly we were pleased with the always-latest AMI policy until a significant bug caused issues (and was swiftly resolved by the community within a "minimum age"). |
The same happened to us: woke up at midnight due to awslabs/amazon-eks-ami#1744 This feature would be useful for us as well. |
To clarify what I would prefer here, a selector for "minimum age" wouldn't be very useful for us since "age" is a number that changes over time and cannot be tested and promoted. Instead, we would want a selector for "AMI must have been created before date X". This is a static number that we can test then promote. |
We're currently explicitly setting the AMI IDs for Karpenter, but this adds a reasonably high overhead, both in engineering and operations. I think a good solution would be for Karpenter to support both a With the above settings we could run engineering clusters with a This would provide an optimised happy path where no additional work would be required outside of testing new AMIs in a timely manner when the AMI is good (which is the majority of the time). In the case of a major AMI issue we'd expect that the active AMI would be reverted before the end of the delay. In the case of an AMI issue specific to our implementation we'd be able to skip it via a A more complete (but more complex) version of the above would be to expose a |
I agree, we tested the approach suggested in the documentation (pinning AMI IDs in higher environments) and we found it adds (avoidable, in my opinion) overhead on the operational side. We tested the same setup as @stevehipwell (pinned AMI ID in production, latest from SSM lower environments), and when a new AMI is released and it's cleared by our testing system, it requires an explicit re-deploy to production.
|
Following bottlerocket 1.26 upgrade issue, this feature would've avoid us quite some maintenance. Thank you! |
Description
** READ BEFORE CONTINUING: If your issue is not specific to AWS, please cut a ticket in kubernetes-sigs/karpenter.
What problem are you trying to solve?
It would be great if I could specify the minimum age of an AMI, so that an AMI can be run in staging for a while before it goes to production.
How important is this feature to you?
We just had a production outage due to a bad AMI.
While we can build a delay mechanism on top using automation/pipeline/etc, a "minimumAge" seems simpler to manage operationally.
The text was updated successfully, but these errors were encountered: