Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks-v2-alpha): EKS Auto Mode support #33373

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Conversation

pahud
Copy link
Contributor

@pahud pahud commented Feb 11, 2025

Issue # (if applicable)

Address #32364 in aws-eks-v2-alpha.

For EKS Auto Mode, all required configs, including computeConfig, kubernetesNetworkConfig, and blockStorage are managed through the defaultCapacityType enum. When set to DefaultCapacityType.AUTOMODE (which is the default), these configurations are automatically enabled. The Cluster construct in aws-eks-v2-alpha enables EKS Auto Mode by default, managing compute resources through node pools instead of creating default capacity or nodegroups. Users can still opt-in to traditional nodegroup management by setting defaultCapacityType to NODEGROUP or EC2.

User Experience:

// Default usage - Auto Mode enabled by default
new eks.Cluster(this, 'hello-eks', {
  vpc,
  version: eks.KubernetesVersion.V1_32,
  kubectlProviderOptions: {
    kubectlLayer: new KubectlV32Layer(this, 'kubectl'),
  },
  // Auto Mode is enabled by default, no need to specify anything
});

// Explicit Auto Mode configuration
new eks.Cluster(this, 'hello-eks', {
  vpc,
  version: eks.KubernetesVersion.V1_32,
  kubectlProviderOptions: {
    kubectlLayer: new KubectlV32Layer(this, 'kubectl'),
  },
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,  // Optional, this is default
  compute: {
    nodePools: ['system', 'general-purpose'],  // Optional, these are default values
    nodeRole: customRole,  // Optional, custom IAM role for nodes
  }
});

Update Summary

  • EKS Auto Mode is the default mode for Cluster construct in V2. When enabled:
    • Automatically manages compute resources through node pools
    • Enables elastic load balancing in Kubernetes networking
    • Enables block storage configuration
    • Will not create defaultCapacity as a NODEGROUP(major difference from aws-eks module)
  • Node pools are case-sensitive and must be "system" and/or "general-purpose"
  • Auto Mode can coexist with manually added node groups for hybrid deployments
  • Required IAM policies are automatically attached
  • Restore the outputConfigCommand support previously in aws-eks module
  • integration test
  • unit tests

Description of how you validated changes

On deploy the autoMode enabled cluster using the code above.

% kubectl create deployment nginx --image=nginx
% kubectl get events --sort-by='.lastTimestamp'
20m         Normal    Nominated                 pod/nginx-5869d7778c-52pzg        Pod should schedule on: nodeclaim/general-purpose-87brc
20m         Normal    Launched                  nodeclaim/general-purpose-87brc   Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
20m         Normal    DisruptionBlocked         nodeclaim/general-purpose-87brc   Nodeclaim does not have an associated node
19m         Normal    NodeHasSufficientPID      node/i-0322e9d8dd1b95a51          Node i-0322e9d8dd1b95a51 status is now: NodeHasSufficientPID
19m         Normal    NodeAllocatableEnforced   node/i-0322e9d8dd1b95a51          Updated Node Allocatable limit across pods
19m         Normal    NodeReady                 node/i-0322e9d8dd1b95a51          Node i-0322e9d8dd1b95a51 status is now: NodeReady
19m         Normal    Ready                     node/i-0322e9d8dd1b95a51          Status condition transitioned, Type: Ready, Status: False -> True, Reason: KubeletReady, Message: kubelet is posting ready status
19m         Normal    Synced                    node/i-0322e9d8dd1b95a51          Node synced successfully
19m         Normal    NodeHasNoDiskPressure     node/i-0322e9d8dd1b95a51          Node i-0322e9d8dd1b95a51 status is now: NodeHasNoDiskPressure
19m         Normal    NodeHasSufficientMemory   node/i-0322e9d8dd1b95a51          Node i-0322e9d8dd1b95a51 status is now: NodeHasSufficientMemory
19m         Warning   InvalidDiskCapacity       node/i-0322e9d8dd1b95a51          invalid capacity 0 on image filesystem
19m         Normal    Starting                  node/i-0322e9d8dd1b95a51          Starting kubelet.
19m         Normal    Registered                nodeclaim/general-purpose-87brc   Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
19m         Normal    Ready                     nodeclaim/general-purpose-87brc   Status condition transitioned, Type: Ready, Status: Unknown -> True, Reason: Ready
19m         Normal    Initialized               nodeclaim/general-purpose-87brc   Status condition transitioned, Type: Initialized, Status: Unknown -> True, Reason: Initialized
19m         Normal    RegisteredNode            node/i-0322e9d8dd1b95a51          Node i-0322e9d8dd1b95a51 event: Registered Node i-0322e9d8dd1b95a51 in Controller
19m         Normal    DisruptionBlocked         node/i-0322e9d8dd1b95a51          Node is nominated for a pending pod
19m         Normal    Scheduled                 pod/nginx-5869d7778c-52pzg        Successfully assigned default/nginx-5869d7778c-52pzg to i-0322e9d8dd1b95a51
19m         Warning   FailedCreatePodSandBox    pod/nginx-5869d7778c-52pzg        Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9bd199c61bd9e93437b10a85af3ddc6965888e01bda96706e153b9e9852f67af": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:50051: connect: connection refused"
19m         Normal    Pulling                   pod/nginx-5869d7778c-52pzg        Pulling image "nginx"
19m         Normal    Pulled                    pod/nginx-5869d7778c-52pzg        Successfully pulled image "nginx" in 2.307s (2.307s including waiting). Image size: 72188133 bytes.
19m         Normal    Created                   pod/nginx-5869d7778c-52pzg        Created container: nginx
19m         Normal    Started                   pod/nginx-5869d7778c-52pzg        Started container nginx

verify the nodes and pods

% kubectl get no
NAME                  STATUS   ROLES    AGE   VERSION
i-0322e9d8dd1b95a51   Ready    <none>   21m   v1.32.0-eks-2e66e76
% kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5869d7778c-52pzg   1/1     Running   0          90m

Checklist

References

eksctl YAML experience

# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-auto-cluster
  region: us-west-2

autoModeConfig:
  # defaults to false
  enabled: true
  # optional, defaults to [general-purpose, system]
  # suggested to leave unspecified
  nodePools: []string
  # optional, eksctl creates a new role if this is not supplied
  # and nodePools are present
  nodeRoleARN: string

Terraform experience:

provider "aws" {
  region = "us-east-1"
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "eks-auto-mode-cluster"
  cluster_version = "1.27"

  vpc_id     = "<your-vpc-id>"
  subnet_ids = ["<subnet-id-1>", "<subnet-id-2>"]

  cluster_compute_config = {
    enabled    = true
    node_pools = ["general-purpose"] # Default pool for Auto Mode
  }

  bootstrap_self_managed_addons = true
}

Pulumi experience

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Create EKS cluster with Auto Mode enabled
const cluster = new aws.eks.Cluster("example", {
    name: "example",
    version: "1.31",
    bootstrapSelfManagedAddons: false,  // Required: Must be false for Auto Mode
    computeConfig: {
        enabled: true,  // Enable Auto Mode compute
        nodePools: ["general-purpose"],
    },
    kubernetesNetworkConfig: {
        elasticLoadBalancing: {
            enabled: true,  // Required for Auto Mode
        },
    },
    storageConfig: {
        blockStorage: {
            enabled: true,  // Required for Auto Mode
        },
    },
});

Links


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@aws-cdk-automation aws-cdk-automation requested a review from a team February 11, 2025 05:20
@github-actions github-actions bot added the p2 label Feb 11, 2025
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Feb 11, 2025
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter fails with the following errors:

❌ The first word of the pull request title should not be capitalized. If the title starts with a CDK construct, it should be in backticks "``".

If you believe this pull request should receive an exemption, please comment and provide a justification. A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed, add Clarification Request to a comment.

Copy link

codecov bot commented Feb 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.20%. Comparing base (5eeee75) to head (7e0215d).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #33373   +/-   ##
=======================================
  Coverage   82.20%   82.20%           
=======================================
  Files         119      119           
  Lines        6862     6862           
  Branches     1158     1158           
=======================================
  Hits         5641     5641           
  Misses       1118     1118           
  Partials      103      103           
Flag Coverage Δ
suite.unit 82.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
packages/aws-cdk ∅ <ø> (∅)
packages/aws-cdk-lib/core 82.20% <ø> (ø)

Copy link

@Issacwww Issacwww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea using simple boolean flag to keep it simple and also keep those auto mode required fields configurable. I will align v1 implementation with this

@pahud pahud marked this pull request as ready for review February 14, 2025 17:24
*
* @default - ['system', 'general-purpose']
*/
readonly nodePools?: string[];
Copy link
Contributor Author

@pahud pahud Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment, EKS only allows to opt in system and/or general-purpose. For future-proofing, I am making it a string[] with validations rather then an enum. This allows us to have the freedom to support more types in the future without BCs.

@pahud pahud marked this pull request as draft February 14, 2025 18:03
/**
* Configuration for compute settings in Auto Mode.
* When enabled, EKS will automatically manage compute resources.
* @default - Auto Mode compute disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the default value should be Auto Mode compute enabled right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should respect autoMode flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed autoMode prop in favor of eks.DefaultCapacityType.AUTOMODE

*
* @default - true
*/
readonly autoMode?: boolean;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this flag? Customers can disable auto mode by setting defaultCapacity or other related properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout.

As we discussed offline, AutoMode and Nodegroup can both exist. I am thinking maybe we should make then not mutual exclusive. For example

autoMode: true || false
defaultCapacity*: can be defined as well

Then we have all the use cases as below:

  1. autoMode on, no NG
  2. autoMode off with default NG defined
  3. autoMode on with default NG defined
  4. autoMode off with default NG undefined and then cluster.addNodegroupCapacity() to explicitly create one using the construct add* method.

Make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed autoMode prop in favor of eks.DefaultCapacityType.AUTOMODE

Comment on lines 1041 to 1058
```ts
// Create EKS cluster with Auto Mode explicitly disabled
const cluster = new eks.Cluster(this, 'EksAutoCluster', {
version: eks.KubernetesVersion.V1_32,
defaultCapacity: 2 // implicitly disable Auto Mode and opt in the a nodegroup
});
```

You can't opt in both Auto Mode and a default nodegroup

```ts
// Create EKS cluster with Auto Mode explicitly disabled
const cluster = new eks.Cluster(this, 'EksAutoCluster', {
version: eks.KubernetesVersion.V1_32,
autoMode: true,
defaultCapacity: 2 // will throw an error
});
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 cases need to be updated as NG can be created on auto mode cluster as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


### Using Auto Mode

By default, the Cluster construct enables EKS Auto mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we specifically mention this is the behavior for v2 CDK? as v1 might differ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Comment on lines 1269 to 1273
// attach required managed policy for the cluster role in EKS Auto Mode
// see - https://docs.aws.amazon.com/eks/latest/userguide/auto-cluster-iam-role.html
['AmazonEKSComputePolicy', 'AmazonEKSBlockStoragePolicy', 'AmazonEKSLoadBalancingPolicy',
'AmazonEKSNetworkingPolicy', 'AmazonEKSClusterPolicy'].forEach((policyName) => {
this.role.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName(policyName));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the same place when add sts:TagSession?

Copy link

@Issacwww Issacwww Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AmazonEKSClusterPolicy already exist in the base role, will add it throw error or just no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no-op but yeah let's remove the dup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSWorkerNodePolicy'),
iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy'),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Comment on lines 1214 to 1218
enabled: !autoModeEnabled ? false : true,
// If the computeConfig enabled flag is set to false when creating a cluster with Auto Mode,
// the request must not include values for the nodeRoleArn or nodePools fields.
nodePools: !autoModeEnabled ? undefined : props.compute?.nodePools ?? ['system', 'general-purpose'],
nodeRoleArn: !autoModeEnabled ? undefined : props.compute?.nodeRole?.roleArn ?? this.addNodePoolRole(`${id}nodePoolRole`).roleArn,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If computeConfig is specified while autoMode is disabled, does it give an error, a warning or just ignoring the property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think it's weird to have

defaultCapacityType.NODEGROUP
with computeConfig.nodePools or nodeRole defined
Let's throw and block them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE with more tests added.

});
```

### Hybrid Mode with Auto Mode and Node Groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Hybrid Mode might confuse customer as there is another feature called Hybrid Nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there are duplicate cases and some contradictory cases, pending clean up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE. Let me know if there's still anything missing.

@pahud pahud marked this pull request as ready for review February 20, 2025 18:56
* When enabled, EKS will automatically manage networking resources.
* @default - Auto Mode networking disabled
*/
readonly kubernetesNetwork?: KubernetesNetworkConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IpFamily and ServiceIpv4Cidr are currently flattened properties. In the L1 resource, they are in KubernetesNetworkConfig. I think we should be consistent here. Either moving them to KubernetesNetworkConfig or making ElasticLoadBalancingConfig a direct property.

@@ -1055,6 +1125,49 @@ export class Cluster extends ClusterBase {
],
});

const autoModeEnabled = !(props.defaultCapacityType !== undefined && props.defaultCapacityType !== DefaultCapacityType.AUTOMODE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const autoModeEnabled = !(props.defaultCapacityType !== undefined && props.defaultCapacityType !== DefaultCapacityType.AUTOMODE);
const autoModeEnabled = !props.defaultCapacityType || props.defaultCapacityType == DefaultCapacityType.AUTOMODE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks but I think this will be even better

props.defaultCapacityType === undefined || props.defaultCapacityType == DefaultCapacityType.AUTOMODE
  1. Explicit Intent: It clearly communicates that we're specifically checking for undefined. The ! operator can be ambiguous as it will convert various falsy values (undefined, null, 0, "", false) to true.

  2. Type Safety: In TypeScript especially, explicit undefined checks are preferred as they make type inference more accurate. The compiler can better understand your intent.

  3. Prevents Bugs: Using ! could accidentally match other falsy values like null, 0, or empty string, which might not be what you want. The explicit check ensures you're only matching undefined.

const autoModeEnabled = !(props.defaultCapacityType !== undefined && props.defaultCapacityType !== DefaultCapacityType.AUTOMODE);

// Throw when using nodePools or nodeRole without using AUTOMODE
if (!autoModeEnabled && (props.compute?.nodePools || props.compute?.nodeRole)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be simplified by checking if (!autoModeEnabled && props.compute)? The error message could be updated as well.

Copy link
Contributor Author

@pahud pahud Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! But there is a risk if eks team creates a new compute config which can be used when autoMode is disabled. The suggested code change might be a problem?

},
storageConfig: {
blockStorage: {
enabled: !autoModeEnabled ? false : true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can blockStorage be disabled when auto mode is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all the three booleans have to be all enabled or all disabled. We learn this from cloudformation errors.

Comment on lines 1136 to 1147
if (autoModeEnabled && (props.defaultCapacity !== undefined || props.defaultCapacityInstance !== undefined)) {
throw new Error('Cannot specify defaultCapacity or defaultCapacityInstance when using Auto Mode. Auto Mode manages compute resources automatically.');
}

// Node pool values are case-sensitive and must be general-purpose and/or system
if (props.compute?.nodePools) {
const validNodePools = ['general-purpose', 'system'];
const invalidPools = props.compute.nodePools.filter(pool => !validNodePools.includes(pool));
if (invalidPools.length > 0) {
throw new Error(`Invalid node pool values: ${invalidPools.join(', ')}. Valid values are: ${validNodePools.join(', ')}`);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move these to if (autoModeEnabled) below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

pahud and others added 3 commits February 23, 2025 11:28
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 7e0215d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS. p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants