-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter will occasionally provision nodes that are way too large #7254
Comments
For additional, reference here is a EC2 fleet request when this problem occurs. I don't think this shows any more information, but thought I would drop it here. It does show again that only a single instance is included in the request which I believe to be the core problem. {
"eventVersion": "1.10",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROA47CRYUGV7O6HGTQFK:1729504785943908908",
"arn": "arn:aws:sts::891377197483:assumed-role/karpenter-20240405181041887100000008/1729504785943908908",
"accountId": "891377197483",
"accessKeyId": "ASIA47CRYUGVSFKCURM4",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AROA47CRYUGV7O6HGTQFK",
"arn": "arn:aws:iam::891377197483:role/karpenter-20240405181041887100000008",
"accountId": "891377197483",
"userName": "karpenter-20240405181041887100000008"
},
"webIdFederationData": {
"federatedProvider": "arn:aws:iam::891377197483:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/83063DDB274B2A04B6A7DC29DCB1740E",
"attributes": {}
},
"attributes": {
"creationDate": "2024-10-21T09:59:45Z",
"mfaAuthenticated": "false"
}
}
},
"eventTime": "2024-10-21T10:20:05Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "CreateFleet",
"awsRegion": "us-east-2",
"sourceIPAddress": "18.218.214.155",
"userAgent": "aws-sdk-go/1.55.5 (go1.22.5; linux; arm64) karpenter.sh-1.0.1",
"requestParameters": {
"CreateFleetRequest": {
"TargetCapacitySpecification": {
"DefaultTargetCapacityType": "spot",
"TotalTargetCapacity": 1
},
"Type": "instant",
"SpotOptions": {
"AllocationStrategy": "price-capacity-optimized"
},
"LaunchTemplateConfigs": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "karpenter.k8s.aws/12914966771093031275",
"Version": "$Latest"
},
"Overrides": {
"ImageId": "ami-0ce6ab0ef12b0b54c",
"AvailabilityZone": "us-east-2b",
"tag": 1,
"SubnetId": "subnet-046ef1097dc37648a",
"InstanceType": "r7iz.metal-16xl"
},
"tag": 1
},
"TagSpecification": [
{
"ResourceType": "instance",
"tag": 1,
"Tag": [
{
"Value": "owned",
"tag": 1,
"Key": "kubernetes.io/cluster/production-primary"
},
{
"Value": "burstable-85b5e108",
"tag": 2,
"Key": "karpenter.sh/nodepool"
},
{
"Value": "production-primary",
"tag": 3,
"Key": "eks:eks-cluster-name"
},
{
"Value": "burstable-f500363d",
"tag": 4,
"Key": "karpenter.k8s.aws/ec2nodeclass"
}
]
},
{
"ResourceType": "volume",
"tag": 2,
"Tag": [
{
"Value": "owned",
"tag": 1,
"Key": "kubernetes.io/cluster/production-primary"
},
{
"Value": "burstable-85b5e108",
"tag": 2,
"Key": "karpenter.sh/nodepool"
},
{
"Value": "production-primary",
"tag": 3,
"Key": "eks:eks-cluster-name"
},
{
"Value": "burstable-f500363d",
"tag": 4,
"Key": "karpenter.k8s.aws/ec2nodeclass"
}
]
},
{
"ResourceType": "fleet",
"tag": 3,
"Tag": [
{
"Value": "burstable-f500363d",
"tag": 1,
"Key": "karpenter.k8s.aws/ec2nodeclass"
},
{
"Value": "owned",
"tag": 2,
"Key": "kubernetes.io/cluster/production-primary"
},
{
"Value": "burstable-85b5e108",
"tag": 3,
"Key": "karpenter.sh/nodepool"
},
{
"Value": "production-primary",
"tag": 4,
"Key": "eks:eks-cluster-name"
}
]
}
]
}
},
"responseElements": {
"CreateFleetResponse": {
"fleetInstanceSet": {
"item": {
"lifecycle": "spot",
"instanceIds": {
"item": "i-031bc83e78cd9423d"
},
"instanceType": "r7iz.metal-16xl",
"launchTemplateAndOverrides": {
"overrides": {
"subnetId": "subnet-046ef1097dc37648a",
"imageId": "ami-0ce6ab0ef12b0b54c",
"instanceType": "r7iz.metal-16xl",
"availabilityZone": "us-east-2b"
},
"launchTemplateSpecification": {
"launchTemplateId": "lt-0a86406a76d5b08be",
"version": 1
}
}
}
},
"xmlns": "http://ec2.amazonaws.com/doc/2016-11-15/",
"requestId": "6c26f26a-d723-4d6b-91ea-f996742ac34b",
"fleetId": "fleet-03bfdd35-440e-cc8f-a6b8-a9025e0ae254",
"errorSet": ""
}
},
"requestID": "6c26f26a-d723-4d6b-91ea-f996742ac34b",
"eventID": "7dd21712-ee0d-4266-ad55-bc2ffcf46f1d",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "891377197483",
"eventCategory": "Management",
"tlsDetails": {
"tlsVersion": "TLSv1.3",
"cipherSuite": "TLS_AES_128_GCM_SHA256",
"clientProvidedHostHeader": "ec2.us-east-2.amazonaws.com"
}
} |
Immediately before the {
"eventVersion": "1.10",
"userIdentity": {
"type": "AssumedRole",
"principalId": "AROA47CRYUGV7O6HGTQFK:1729504785943908908",
"arn": "arn:aws:sts::891377197483:assumed-role/karpenter-20240405181041887100000008/1729504785943908908",
"accountId": "891377197483",
"accessKeyId": "ASIA47CRYUGVSFKCURM4",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "AROA47CRYUGV7O6HGTQFK",
"arn": "arn:aws:iam::891377197483:role/karpenter-20240405181041887100000008",
"accountId": "891377197483",
"userName": "karpenter-20240405181041887100000008"
},
"webIdFederationData": {
"federatedProvider": "arn:aws:iam::891377197483:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/83063DDB274B2A04B6A7DC29DCB1740E",
"attributes": {}
},
"attributes": {
"creationDate": "2024-10-21T09:59:45Z",
"mfaAuthenticated": "false"
}
}
},
"eventTime": "2024-10-21T10:20:04Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "DescribeLaunchTemplates",
"awsRegion": "us-east-2",
"sourceIPAddress": "18.218.214.155",
"userAgent": "aws-sdk-go/1.55.5 (go1.22.5; linux; arm64) karpenter.sh-1.0.1",
"errorCode": "Client.InvalidLaunchTemplateName.NotFoundException",
"errorMessage": "At least one of the launch templates specified in the request does not exist.",
"requestParameters": {
"DescribeLaunchTemplatesRequest": {
"LaunchTemplateName": {
"tag": 1,
"content": "karpenter.k8s.aws/12914966771093031275"
}
}
},
"responseElements": null,
"requestID": "a02a278f-e6de-4c1b-bca7-738050942304",
"eventID": "7888c311-84c1-4829-85c9-7d5bf7c50398",
"readOnly": true,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "891377197483",
"eventCategory": "Management",
"tlsDetails": {
"tlsVersion": "TLSv1.3",
"cipherSuite": "TLS_AES_128_GCM_SHA256",
"clientProvidedHostHeader": "ec2.us-east-2.amazonaws.com"
}
} |
Found the problem in the core Karpenter library. Not specific to AWS. Moving the conversation there. |
Description
Observed Behavior:
Occasionally, Karpenter will provision a node that is far, far above what is being requested.
For example, notice the provisioned node below is 10x larger than what is being requested. Moreover, the generated nodeclaim only has a single entry for
instance-types
.That is despite the NodePool (manifest below) having many, many instances types that would fit the scheduling request (which it normally does).
Expected Behavior:
When a set of pods is pending and needs a new node, the generated node claim includes all applicable
instance-types
and an appropriately sized node is created.This normally works correctly and generates logs as follows:
Reproduction Steps (Please include YAML):
It is unclear to me how to reproduce. I have tried all the obvious things and am not able to reliability re-trigger the behavior (it seems to occur somewhat randomly):
I have also verified that the pods do not have any scheduling constraints that would limit them to a single instance type.
In fact, which particular type is chosen for
instance-types
seems somewhat random. Sometimes it is appropriately sized, sometimes it is 10x too large, sometimes it is 100x too large. The instance families also differ. However, what is consistent is the the node claim is (a) created by theprovisioner
controller and (b) gets generated with just a single type rather than the full expected set.After the node is created, Karpenter will then usually disrupt it shortly after and replace it with a smaller node. However, we have sometimes had PDBs prevent this which is when we noticed that this behavior was occurring.
Additionally, all of the NodePools where we have observed this behavior allow spot instances, but I do not know if that is relevant (all of our NodePools are spot-enabled).
Versions:
1.0.1
kubectl version
):v1.29.8-eks-a737599
The text was updated successfully, but these errors were encountered: