Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed node group upgrade using Bottlerocket fails #4423

Closed
Anorlondo448 opened this issue Nov 7, 2021 · 15 comments · Fixed by #4666
Closed

Managed node group upgrade using Bottlerocket fails #4423

Anorlondo448 opened this issue Nov 7, 2021 · 15 comments · Fixed by #4666
Assignees
Labels

Comments

@Anorlondo448
Copy link

What were you trying to accomplish?
I tried upgrading a bottlerocket instance using eksctl upgrade nodegroup.
And I specified kubernetes-version as an option.

eksctl upgrade nodegroup --name=bottlerocket-nodegroup --cluster=cluster-name --kubernetes-version=1.21

What happened?
Upgrade fails.
The release version of the AMI that can be obtained from the kubernetes version is output as invalid.
(See below)

How to reproduce it?

  • Create a managed node group using bottlerocket with eksctl create nodegroup.
  • Upgrade with --kubernetes-version in eksctl upgrade nodegroup.
  • Specifying the release version of the Bottlerocket AMI with --release-version succeeds.

Logs

  • cli error output:
Error: error updating nodegroup stack: waiting for CloudFormation stack "eksctl-cluster-name-nodegroup-bottlerocket-nodegroup": ResourceNotReady: failed waiting for successful resource state
  • CloudFormation error output:
Requested release version 1.21.4-20211013 is not valid for kubernetes version 1.21

Anything else we need to know?
In the following part, even the Bottlerocket AMI has acquired the AMI release version of Amazon Linux2.

https://github.com/weaveworks/eksctl/blob/7565de70d46cd0930929b834527e5a1368ecb46e/pkg/managed/service.go#L345

Versions

$ eksctl info
eksctl version: 0.72.0
kubectl version: v1.21.5
OS: darwin
@cPu1 cPu1 added the area/managed-nodegroup EKS Managed Nodegroups label Nov 8, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Dec 9, 2021
@cPu1 cPu1 removed the stale label Dec 9, 2021
@cPu1
Copy link
Contributor

cPu1 commented Dec 9, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Not stale, we need to look into this.

@Himangini
Copy link
Collaborator

@Anorlondo448 are you still having this issue? does the latest eksctl version help?

@aclevername aclevername self-assigned this Jan 14, 2022
@aclevername
Copy link
Contributor

I've managed to reproduce this issue:

# NOTE: Bottlerocket AMI might not be available in all regions.
# Please check AWS official doc or below link for more details
# https://github.com/bottlerocket-os/bottlerocket/blob/develop/QUICKSTART.md#finding-an-ami
# A simple example of ClusterConfig object with Bottlerocket settings:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: jk
  region: us-west-2
  version: "1.20"

managedNodeGroups:
  - name: mng1
    instanceType: m5.xlarge
    desiredCapacity: 1
    amiFamily: Bottlerocket
    labels:
      "network-locality.example.com/public": "true"
    bottlerocket:
      enableAdminContainer: true
      settings:
        motd: "Hello, eksctl!"

Upgrading this to 1.21 produced the same error, complaing that the release_version is invalid. As @Anorlondo448 points out we have hardcoded in AmazonLinux2 as the AMI type when fetching the release version, which is whats causing them problems.

I tried changing https://github.com/weaveworks/eksctl/blob/7565de70d46cd0930929b834527e5a1368ecb46e/pkg/managed/service.go#L345 to query for bottlerocket instead, but it appears there isn't a way to fetch the release version.

For AL2 we query:

$ aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/release_version --region us-west-2
{
"Parameter": {
        "Name": "/aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/release_version",
        "Type": "String",
        "Value": "1.21.5-20220112",
        "Version": 16,
        "LastModifiedDate": "2022-01-12T23:53:53.136000+00:00",
        "ARN": "arn:aws:ssm:us-west-2::parameter/aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/release_version",
        "DataType": "text"
    }
}

However I cannot find the equivalent query for bottlerocket:

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.20/bottlerocket/recommended/release_version --region us-west-2

An error occurred (ParameterNotFound) when calling the GetParameter operation:

Looking at the docs here https://docs.aws.amazon.com/eks/latest/userguide/retrieve-ami-id-bottlerocket.html it suggests

aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.21/x86_64/latest/image_id --region region-code --query "Parameter.Value" --output text

as a way to query for the image_id, it should be that you can replace image_id with release_version to get the release version, but that doesn't work either:

aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.20/x86_64/latest/release_version --region us-west-2

An error occurred (ParameterNotFound) when calling the GetParameter operation:

Why do we fetch the release version?

One thing I don't understand is why we are updating the release_version field as the method for updating the nodegroup version. Their is a more appropriate version field that is for kubernetes. Docs https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html#cfn-eks-nodegroup-version

Version
The Kubernetes version to use for your managed nodes. By default, the Kubernetes version of the cluster is used, and this is the only accepted specified value. If you specify launchTemplate, and your launch template uses a custom AMI, then don't specify version, or the node group deployment will fail. For more information about using launch templates with Amazon EKS, see Launch template support in the Amazon EKS User Guide.

Required: No

Type: String

ReleaseVersion
The AMI version of the Amazon EKS optimized AMI to use with your node group (for example, 1.14.7-YYYYMMDD). By default, the latest available AMI version for the node group's current Kubernetes version is used. For more information, see Amazon EKS optimized Linux AMI Versions in the Amazon EKS User Guide.

Note: Changing this value triggers an update of the node group if one is available. However, only the latest available AMI release version is valid as an input. You cannot roll back to a previous AMI release version.

Required: No

Type: String

Reading the docs it appears that its much simpler for us to just update version to the kubernetes version as a method of upgrading, as I don't see the value in fetching the release_version, which only appears to work for AL2.

I've confirmed manually the setting the version field in a bottlerocket template to the next version of kubernetes successfully upgrades it.

@aclevername
Copy link
Contributor

I'm going to start working on a PR to migrate this to update the version field instead of release_version.

@cPu1 do you have any context on why we do this approach?

@cPu1
Copy link
Contributor

cPu1 commented Jan 17, 2022

I'm going to start working on a PR to migrate this to update the version field instead of release_version.

@cPu1 do you have any context on why we do this approach?

We query the latest release version and set it explicitly in the CloudFormation template to force CFN to apply the changeset if nothing else has changed in the template. Otherwise, if you attempt to upgrade a nodegroup that's on the same Kubernetes version as the control plane, the changeset will be empty and it won't be upgraded to the latest release version. This is required because of CloudFormation's declarative nature.

@aclevername
Copy link
Contributor

We query the latest release version and set it explicitly in the CloudFormation template to force CFN to apply the changeset if nothing else has changed in the template. Otherwise, if you attempt to upgrade a nodegroup that's on the same Kubernetes version as the control plane, the changeset will be empty and it won't be upgraded to the latest release version. This is required because of CloudFormation's declarative nature.

When would you wan't to upgrade a nodegroups thats already on the same k8s version? When running something like
eksctl upgrade nodegroup --name=bottlerocket-nodegroup --cluster=cluster-name --kubernetes-version=1.21 where you don't specify a launchTemplate/releaseVersion, surely it should just be a no-op?

@cPu1
Copy link
Contributor

cPu1 commented Jan 17, 2022

When would you wan't to upgrade a nodegroups thats already on the same k8s version?

EKS also publishes new AMIs for existing Kubernetes versions.

When running something like
eksctl upgrade nodegroup --name=bottlerocket-nodegroup --cluster=cluster-name --kubernetes-version=1.21 where you don't specify a launchTemplate/releaseVersion, surely it should just be a no-op?

It won't be a no-op if there's a newer release version available for the current version of Kubernetes for the specified nodegroup.

@aclevername
Copy link
Contributor

EKS also publishes new AMIs for existing Kubernetes versions.

👍 gotcha

so for bottlerocket where we can't seem to find the release_version (I'll dig further to see if we can find it), it just setting version instead the best workaround?

@aclevername
Copy link
Contributor

After chatting with some folks at AWS the bottlerocket query to fetch the release version is: /aws/service/bottlerocket/aws-k8s-1.21/x86_64/latest/image_version. Going to open a PR to fix and close this issue

@aclevername
Copy link
Contributor

So I've been testing out the fix and seeing some strange behaviour:
When querying for AL2 images you get values like:

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.21/amazon-linux-2/recommended/release_version --region us-west-2
        "Value": "1.21.5-20220112"

aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.20/amazon-linux-2/recommended/release_version --region us-west-2
        "Value": "1.20.11-20220112"

You can see how the value adjusts per k8s version. However for bottlerocket:

 aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.21/x86_64/latest/image_version --region us-west-2
        "Value": "1.5.2-1602f3a8"

 aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.20/x86_64/latest/image_version --region us-west-2
        "Value": "1.5.2-1602f3a8"

It returns the same value for different k8s versions, and doesn't seem to follow the k8s versioning pattern observed for AmazonLinux2.

@aclevername
Copy link
Contributor

I'm going to switch back to the approach of updating version due to the above conflict. I've noticed a new problem with this

eksctl upgrade nodegroup --cluster jk-br --name br-2 --kubernetes-version 1.21
2022-01-21 12:02:52 [ℹ]  upgrading nodegroup version
2022-01-21 12:02:52 [ℹ]  updating nodegroup stack
2022-01-21 12:02:52 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1642766572" for stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:03:09 [ℹ]  waiting for CloudFormation changeset "eksctl-update-nodegroup-1642766572" for stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:03:10 [ℹ]  waiting for CloudFormation stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:03:30 [ℹ]  waiting for CloudFormation stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:03:48 [ℹ]  waiting for CloudFormation stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:04:05 [ℹ]  waiting for CloudFormation stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:04:06 [✖]  unexpected status "UPDATE_ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-jk-br-nodegroup-br-2"
2022-01-21 12:04:06 [ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_ROLLBACK_IN_PROGRESS – "The following resource(s) failed to update: [ManagedNodeGroup]. "
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: UPDATE_FAILED – "Version and ReleaseVersion updates cannot be combined with other updates"
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: UPDATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_IN_PROGRESS – "User Initiated"
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_ROLLBACK_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: UPDATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_ROLLBACK_IN_PROGRESS – "The following resource(s) failed to update: [ManagedNodeGroup]. "
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: UPDATE_FAILED – "Version and ReleaseVersion updates cannot be combined with other updates"
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: UPDATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: UPDATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: UPDATE_IN_PROGRESS – "User Initiated"
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: CREATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_IN_PROGRESS – "Resource creation Initiated"
2022-01-21 12:04:06 [ℹ]  AWS::EKS::Nodegroup/ManagedNodeGroup: CREATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: CREATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::EC2::LaunchTemplate/LaunchTemplate: CREATE_COMPLETE
2022-01-21 12:04:06 [ℹ]  AWS::EC2::LaunchTemplate/LaunchTemplate: CREATE_IN_PROGRESS – "Resource creation Initiated"
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: CREATE_IN_PROGRESS – "Resource creation Initiated"
2022-01-21 12:04:06 [ℹ]  AWS::EC2::LaunchTemplate/LaunchTemplate: CREATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::IAM::Role/NodeInstanceRole: CREATE_IN_PROGRESS
2022-01-21 12:04:06 [ℹ]  AWS::CloudFormation::Stack/eksctl-jk-br-nodegroup-br-2: CREATE_IN_PROGRESS – "User Initiated"

Version and ReleaseVersion updates cannot be combined with other updates refers to the fact we are also setting the ForceUpdateEnabled value ast the same time (related PR). I think we need to do this in a separate stack update to the nodegroup version update. Sound reasonable @cPu1 ?

@cPu1
Copy link
Contributor

cPu1 commented Jan 24, 2022

Version and ReleaseVersion updates cannot be combined with other updates refers to the fact we are also setting the ForceUpdateEnabled value ast the same time (related PR)

I was aware that a nodegroup update cannot change fields other than Version and ReleaseVersion but I had assumed it'd allow ForceUpdateEnabled to be included as it relates to a nodegroup update. Can you verify that the generated changeset had ForceUpdateEnabled?

@aclevername
Copy link
Contributor

aclevername commented Jan 24, 2022

Version and ReleaseVersion updates cannot be combined with other updates refers to the fact we are also setting the ForceUpdateEnabled value ast the same time (related PR)

I was aware that a nodegroup update cannot change fields other than Version and ReleaseVersion but I had assumed it'd allow ForceUpdateEnabled to be included as it relates to a nodegroup update. Can you verify that the generated changeset had ForceUpdateEnabled?

yes this field was definitely the cause of the problem

@cloudkarthik99
Copy link

Hey @cPu1 ,
So as per the above conversation, the kubernetes version upgrade is the only way to have the latest bottlerocket AMI versions in the EKS cluster?
We have 2 managed nodegroups in separate EKS clusters are deployed with Amazon Linux and Bottlerocket respectively through cluster config using cloudformation stack.
And we dont have node auto updates, everytime we run CI/CD pipeline the cloudformation stack.
Amazon Linux AMI is successfully getting the latest AMI release version within the same k8s version but that't not the same case for bottlerocket. The release version AMI is updating only when when the k8s version changes.

Is there any workaround you would suggest as we need updated latest AMI versions for the bottlerocket nodegroups as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants