Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selector Terms for AMI Minimum Age #5382

Open
lnattrass opened this issue Dec 22, 2023 · 12 comments
Open

Selector Terms for AMI Minimum Age #5382

lnattrass opened this issue Dec 22, 2023 · 12 comments
Labels
feature New feature or request

Comments

@lnattrass
Copy link

Description

** READ BEFORE CONTINUING: If your issue is not specific to AWS, please cut a ticket in kubernetes-sigs/karpenter.

What problem are you trying to solve?
It would be great if I could specify the minimum age of an AMI, so that an AMI can be run in staging for a while before it goes to production.

How important is this feature to you?
We just had a production outage due to a bad AMI.
While we can build a delay mechanism on top using automation/pipeline/etc, a "minimumAge" seems simpler to manage operationally.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@lnattrass lnattrass added feature New feature or request needs-triage Issues that need to be triaged labels Dec 22, 2023
@hamishforbes
Copy link

I'm running into the same bad AMI issue.
My immediate idea for a fix was to add a NotIn term to the AMI selector for the bad AMI id, but thats not possible

@jonathan-innis
Copy link
Contributor

Somewhat relates to #4769 and #1495 which want to alter the AMI selection logic; though, the requirements here are slightly different.

It would be great if I could specify the minimum age of an AMI, so that an AMI can be run in staging for a while before it goes to production

Were you using the default AMI that was passed through with the AMIFamily or were you specifying specific AMIs in the amiSelectorTerms? In either case, I'd imagine that it would probably make sense to have an amiMinimumAge field in the EC2NodeClass spec to control filtering out AMIs based on age regardless of what gets selected by the amiFamily or the amiSelectorTerms. I doubt that you would want to define this per-term.

My immediate idea for a fix was to add a NotIn term to the AMI selector for the bad AMI id

Agree that this might also be something that would be nice to have. Since these terms are meant to perform a matching mechanism against the instance types that we choose to launch with, one idea here is that we allow a requirements section in the amiSelectorTerms that shows which amis match-up with which instance types if you wanted to get really specific like

amiSelectorTerms:
- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
  requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["c5.large"]

so you could exclude an AMI indirectly by doing

- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
   requirements:
   - key: "node.kubernetes.io/instance-type"
      operator: DoesNotExist

That's obviously more indirect for the use-case, so one other alternative is that we create something like an exclude/include field in the API block like

- tags: 
        karpenter.sh/discovery: ${CLUSTER_NAME}
- id: ${BAD_AMI_ID}
   exclude/reject: true

Realistically, I think the NotIn operator syntax may be too flexible for what we need here, since we aren't working with labels in the selection logic. Particularly, I'm not sure how we would represent tags as a key if we took-up a similar syntax to matchExpressions from nodeAffinity.

@grandich
Copy link

grandich commented Jan 3, 2024

+1

The 'NotIn' operator is solely applicable post-outage occurrence and when you possess knowledge of the faulty AMI ID. To entirely circumvent issues such as the 'file limit bad AMI,' the sole approach is to enforce a minimum age requirement, like 1 or 2 weeks, as a preventive measure.

@cebernardi
Copy link
Contributor

cebernardi commented Jan 16, 2024

would you review PRs for the suggested amiMinimumAge field under EC2NodeClass?
We were hit as well from the Bad AMI issue, and this feature would have certainly prevented it from reaching production.

@cebernardi
Copy link
Contributor

@jonathan-innis I took a stab and I drafted a PR.
I'm not sure if someone is already working to this feature and/or if it's already on the roadmap, but if not, I'm curious to hear your thoughts about the proposed solution.

@jonathan-innis
Copy link
Contributor

the sole approach is to enforce a minimum age requirement, like 1 or 2 weeks, as a preventive measure

@grandich Rather than introducing a minimum age requirement, does it make sense for you to pin the image id on the EC2NodeClass in your amiSeleectorTerms and then orchestrate the rollout of newer AMIs through your CI platform? This way, you could handle the testing and validation in your lower environments and then promote the updated AMIs to your higher environments. Do you need an automated mechanism like minAge or selecting the n-1 AMI to do this or can this be fully orchestrated through an outside CI platform?

@grandich
Copy link

@jonathan-innis (Apologies if I haven't fully grasped or addressed your rationale regarding the orchestration of this process by a CI platform)

A CI platform could potentially manage this orchestration, but we sought a solution independent of such dependency.

Also, we aim to steer clear of pinning due to the maintenance overhead and the risk of overlooking AMI updates, potentially resulting in reliance on outdated AMIs. We require an automated solution, whether it be through a minimum age parameter or an automated 'n-1' or 'n-2' mechanism.

Regarding testing AMIs, in specific scenarios, the bug may not manifest in lower environments, such as the one of file limits, so in such instances, the age of the AMI universally adopted by the community serves as a more robust assurance.

The rationale behind the 'minimum age' concept mirrors that of a 'beta stage' for the AMIs during that period. Frankly we were pleased with the always-latest AMI policy until a significant bug caused issues (and was swiftly resolved by the community within a "minimum age").

@cep21
Copy link

cep21 commented Apr 2, 2024

Frankly we were pleased with the always-latest AMI policy until a significant bug caused issues

The same happened to us: woke up at midnight due to awslabs/amazon-eks-ami#1744

This feature would be useful for us as well.

@cep21
Copy link

cep21 commented Apr 2, 2024

To clarify what I would prefer here, a selector for "minimum age" wouldn't be very useful for us since "age" is a number that changes over time and cannot be tested and promoted.

Instead, we would want a selector for "AMI must have been created before date X". This is a static number that we can test then promote.

@stevehipwell
Copy link
Contributor

We're currently explicitly setting the AMI IDs for Karpenter, but this adds a reasonably high overhead, both in engineering and operations. I think a good solution would be for Karpenter to support both a NotIn operator and a minimum age configuration to consume the AMI from SSM.

With the above settings we could run engineering clusters with a 0 minimum age to test new AMIs as they're released, and then have the other clusters running with an appropriate delay so new AMI have the opportunity to be tested.

This would provide an optimised happy path where no additional work would be required outside of testing new AMIs in a timely manner when the AMI is good (which is the majority of the time). In the case of a major AMI issue we'd expect that the active AMI would be reverted before the end of the delay. In the case of an AMI issue specific to our implementation we'd be able to skip it via a NotIn selector while taking subsequent versions (it's easier to get end users to make a change to stop a cluster breaking than to remove a workaround). In other more complex scenarios we'd still be able to fall back to explicitly setting the AMI ID.


A more complete (but more complex) version of the above would be to expose a defaultDelay configuration parameter to be paired with inputs for AMIs which shouldn't be used and for AMIs which can be used ahead of the default delay. The advantage of this pattern would be that it supports the two high value scenarios, a bad AMI or an AMI fixing a CVE where we want to roll out ASAP. This could be configured with the AMI config (as above), in the cluster but with separation via a new CRD, or outside of the cluster using something like SSM.

@cebernardi
Copy link
Contributor

cebernardi commented Apr 15, 2024

I agree, we tested the approach suggested in the documentation (pinning AMI IDs in higher environments) and we found it adds (avoidable, in my opinion) overhead on the operational side.

We tested the same setup as @stevehipwell (pinned AMI ID in production, latest from SSM lower environments), and when a new AMI is released and it's cleared by our testing system, it requires an explicit re-deploy to production.

amiMinimumAge would allow operators to intervene with a redeploy of Karpenter configuration only when the testing system picks up a deviation, hence saving operational time

@JulesClaussen
Copy link

Following bottlerocket 1.26 upgrade issue, this feature would've avoid us quite some maintenance.
I can see that the draft PR has been closed following this comment: #5551 (comment). Do you mind elaborating on that?

Thank you!
Jules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants