Selector Terms for AMI Minimum Age #5382

lnattrass · 2023-12-22T16:29:22Z

Description

** READ BEFORE CONTINUING: If your issue is not specific to AWS, please cut a ticket in kubernetes-sigs/karpenter.

What problem are you trying to solve?
It would be great if I could specify the minimum age of an AMI, so that an AMI can be run in staging for a while before it goes to production.

How important is this feature to you?
We just had a production outage due to a bad AMI.
While we can build a delay mechanism on top using automation/pipeline/etc, a "minimumAge" seems simpler to manage operationally.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

hamishforbes · 2023-12-22T19:03:52Z

I'm running into the same bad AMI issue.
My immediate idea for a fix was to add a NotIn term to the AMI selector for the bad AMI id, but thats not possible

jonathan-innis · 2023-12-22T21:04:38Z

Somewhat relates to #4769 and #1495 which want to alter the AMI selection logic; though, the requirements here are slightly different.

It would be great if I could specify the minimum age of an AMI, so that an AMI can be run in staging for a while before it goes to production

Were you using the default AMI that was passed through with the AMIFamily or were you specifying specific AMIs in the amiSelectorTerms? In either case, I'd imagine that it would probably make sense to have an amiMinimumAge field in the EC2NodeClass spec to control filtering out AMIs based on age regardless of what gets selected by the amiFamily or the amiSelectorTerms. I doubt that you would want to define this per-term.

My immediate idea for a fix was to add a NotIn term to the AMI selector for the bad AMI id

Agree that this might also be something that would be nice to have. Since these terms are meant to perform a matching mechanism against the instance types that we choose to launch with, one idea here is that we allow a requirements section in the amiSelectorTerms that shows which amis match-up with which instance types if you wanted to get really specific like

amiSelectorTerms:
- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
  requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["c5.large"]

so you could exclude an AMI indirectly by doing

- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
   requirements:
   - key: "node.kubernetes.io/instance-type"
      operator: DoesNotExist

That's obviously more indirect for the use-case, so one other alternative is that we create something like an exclude/include field in the API block like

- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
   exclude/reject: true

Realistically, I think the NotIn operator syntax may be too flexible for what we need here, since we aren't working with labels in the selection logic. Particularly, I'm not sure how we would represent tags as a key if we took-up a similar syntax to matchExpressions from nodeAffinity.

grandich · 2024-01-03T19:05:20Z

+1

The 'NotIn' operator is solely applicable post-outage occurrence and when you possess knowledge of the faulty AMI ID. To entirely circumvent issues such as the 'file limit bad AMI,' the sole approach is to enforce a minimum age requirement, like 1 or 2 weeks, as a preventive measure.

cebernardi · 2024-01-16T15:04:58Z

would you review PRs for the suggested amiMinimumAge field under EC2NodeClass?
We were hit as well from the Bad AMI issue, and this feature would have certainly prevented it from reaching production.

cebernardi · 2024-01-29T12:04:15Z

@jonathan-innis I took a stab and I drafted a PR.
I'm not sure if someone is already working to this feature and/or if it's already on the roadmap, but if not, I'm curious to hear your thoughts about the proposed solution.

jonathan-innis · 2024-02-20T21:54:28Z

the sole approach is to enforce a minimum age requirement, like 1 or 2 weeks, as a preventive measure

@grandich Rather than introducing a minimum age requirement, does it make sense for you to pin the image id on the EC2NodeClass in your amiSeleectorTerms and then orchestrate the rollout of newer AMIs through your CI platform? This way, you could handle the testing and validation in your lower environments and then promote the updated AMIs to your higher environments. Do you need an automated mechanism like minAge or selecting the n-1 AMI to do this or can this be fully orchestrated through an outside CI platform?

grandich · 2024-02-20T22:38:34Z

@jonathan-innis (Apologies if I haven't fully grasped or addressed your rationale regarding the orchestration of this process by a CI platform)

A CI platform could potentially manage this orchestration, but we sought a solution independent of such dependency.

Also, we aim to steer clear of pinning due to the maintenance overhead and the risk of overlooking AMI updates, potentially resulting in reliance on outdated AMIs. We require an automated solution, whether it be through a minimum age parameter or an automated 'n-1' or 'n-2' mechanism.

Regarding testing AMIs, in specific scenarios, the bug may not manifest in lower environments, such as the one of file limits, so in such instances, the age of the AMI universally adopted by the community serves as a more robust assurance.

The rationale behind the 'minimum age' concept mirrors that of a 'beta stage' for the AMIs during that period. Frankly we were pleased with the always-latest AMI policy until a significant bug caused issues (and was swiftly resolved by the community within a "minimum age").

cep21 · 2024-04-02T16:13:20Z

Frankly we were pleased with the always-latest AMI policy until a significant bug caused issues

The same happened to us: woke up at midnight due to awslabs/amazon-eks-ami#1744

This feature would be useful for us as well.

cep21 · 2024-04-02T16:19:52Z

To clarify what I would prefer here, a selector for "minimum age" wouldn't be very useful for us since "age" is a number that changes over time and cannot be tested and promoted.

Instead, we would want a selector for "AMI must have been created before date X". This is a static number that we can test then promote.

stevehipwell · 2024-04-12T10:34:31Z

We're currently explicitly setting the AMI IDs for Karpenter, but this adds a reasonably high overhead, both in engineering and operations. I think a good solution would be for Karpenter to support both a NotIn operator and a minimum age configuration to consume the AMI from SSM.

With the above settings we could run engineering clusters with a 0 minimum age to test new AMIs as they're released, and then have the other clusters running with an appropriate delay so new AMI have the opportunity to be tested.

This would provide an optimised happy path where no additional work would be required outside of testing new AMIs in a timely manner when the AMI is good (which is the majority of the time). In the case of a major AMI issue we'd expect that the active AMI would be reverted before the end of the delay. In the case of an AMI issue specific to our implementation we'd be able to skip it via a NotIn selector while taking subsequent versions (it's easier to get end users to make a change to stop a cluster breaking than to remove a workaround). In other more complex scenarios we'd still be able to fall back to explicitly setting the AMI ID.

A more complete (but more complex) version of the above would be to expose a defaultDelay configuration parameter to be paired with inputs for AMIs which shouldn't be used and for AMIs which can be used ahead of the default delay. The advantage of this pattern would be that it supports the two high value scenarios, a bad AMI or an AMI fixing a CVE where we want to roll out ASAP. This could be configured with the AMI config (as above), in the cluster but with separation via a new CRD, or outside of the cluster using something like SSM.

cebernardi · 2024-04-15T11:01:25Z

I agree, we tested the approach suggested in the documentation (pinning AMI IDs in higher environments) and we found it adds (avoidable, in my opinion) overhead on the operational side.

We tested the same setup as @stevehipwell (pinned AMI ID in production, latest from SSM lower environments), and when a new AMI is released and it's cleared by our testing system, it requires an explicit re-deploy to production.

amiMinimumAge would allow operators to intervene with a redeploy of Karpenter configuration only when the testing system picks up a deviation, hence saving operational time

JulesClaussen · 2024-10-24T10:53:25Z

Following bottlerocket 1.26 upgrade issue, this feature would've avoid us quite some maintenance.
I can see that the draft PR has been closed following this comment: #5551 (comment). Do you mind elaborating on that?

Thank you!
Jules

lnattrass added feature New feature or request needs-triage Issues that need to be triaged labels Dec 22, 2023

jonathan-innis removed the needs-triage Issues that need to be triaged label Dec 22, 2023

cartermckinnon mentioned this issue Dec 25, 2023

File descriptor limit change in AMI release v20231220 awslabs/amazon-eks-ami#1551

Closed

cebernardi mentioned this issue Jan 29, 2024

[DRAFT] Ami minimum age #5551

Closed

3 tasks

mikestef9 mentioned this issue Mar 29, 2024

Create a Delayed Release Process bottlerocket-os/bottlerocket#3813

Open

cbgbt mentioned this issue Nov 1, 2024

v1.26.0 ⚱️Tracking Issue bottlerocket-os/bottlerocket#4253

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selector Terms for AMI Minimum Age #5382

Selector Terms for AMI Minimum Age #5382

lnattrass commented Dec 22, 2023

hamishforbes commented Dec 22, 2023

jonathan-innis commented Dec 22, 2023

grandich commented Jan 3, 2024

cebernardi commented Jan 16, 2024 •

edited

Loading

cebernardi commented Jan 29, 2024

jonathan-innis commented Feb 20, 2024

grandich commented Feb 20, 2024

cep21 commented Apr 2, 2024

cep21 commented Apr 2, 2024

stevehipwell commented Apr 12, 2024

cebernardi commented Apr 15, 2024 •

edited

Loading

JulesClaussen commented Oct 24, 2024

Selector Terms for AMI Minimum Age #5382

Selector Terms for AMI Minimum Age #5382

Comments

lnattrass commented Dec 22, 2023

Description

hamishforbes commented Dec 22, 2023

jonathan-innis commented Dec 22, 2023

grandich commented Jan 3, 2024

cebernardi commented Jan 16, 2024 • edited Loading

cebernardi commented Jan 29, 2024

jonathan-innis commented Feb 20, 2024

grandich commented Feb 20, 2024

cep21 commented Apr 2, 2024

cep21 commented Apr 2, 2024

stevehipwell commented Apr 12, 2024

cebernardi commented Apr 15, 2024 • edited Loading

JulesClaussen commented Oct 24, 2024

cebernardi commented Jan 16, 2024 •

edited

Loading

cebernardi commented Apr 15, 2024 •

edited

Loading