-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399
Comments
/triage accepted |
@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass. If so, I will reduce the priority accordingly. |
This is the first time I've seen the error and hence agree that it is ClusterClass related. |
This was such a fascinating and difficult issue to debug. Observations
DebuggingFor a failed instance creation, below is input sent to AWS API. {
"instancesSet": {
"items": [
{
"imageId": "ami-093e132cf8ec45d77",
"minCount": 1,
"maxCount": 1,
"keyName": "cluster-api-provider-aws-sigs-k8s-io"
}
]
},
"groupSet": {
"items": [
{
"groupId": "sg-07c3eb751181ac0ab"
},
{
"groupId": "sg-05683bb88ffba846b"
},
{
"groupId": "sg-08f3c5c87413f9212"
}
]
},
"userData": "<sensitiveDataRemoved>",
"instanceType": "t3.large",
"blockDeviceMapping": {},
"monitoring": {
"enabled": false
},
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
"iamInstanceProfile": {
"name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
},
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "MachineName",
"value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
},
{
"key": "Name",
"value": "cluster-qmul89-control-plane-n9994-2lrrt"
},
{
"key": "kubernetes.io/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/role",
"value": "control-plane"
}
]
}
]
}
} Compare it with an input for successful case {
"instancesSet": {
"items": [
{
"imageId": "ami-093e132cf8ec45d77",
"minCount": 1,
"maxCount": 1,
"keyName": "cluster-api-provider-aws-sigs-k8s-io"
}
]
},
"groupSet": {
"items": [
{
"groupId": "sg-07c3eb751181ac0ab"
},
{
"groupId": "sg-05683bb88ffba846b"
},
{
"groupId": "sg-08f3c5c87413f9212"
}
]
},
"userData": "<sensitiveDataRemoved>",
"instanceType": "t3.large",
"blockDeviceMapping": {},
"monitoring": {
"enabled": false
},
"subnetId": "subnet-04069978047301fce",
"disableApiTermination": false,
"disableApiStop": false,
"clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
"iamInstanceProfile": {
"name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
},
"tagSpecificationSet": {
"items": [
{
"resourceType": "instance",
"tags": [
{
"key": "MachineName",
"value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
},
{
"key": "Name",
"value": "cluster-qmul89-control-plane-n9994-2lrrt"
},
{
"key": "kubernetes.io/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
"value": "owned"
},
{
"key": "sigs.k8s.io/cluster-api-provider-aws/role",
"value": "control-plane"
}
]
}
]
}
} The difference is that the failed case doesn't have Root Cause AnalysisThis happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320 AWSCluster subnet spec oscillates between two states with ClusterClass.
network:
...
subnets:
- availabilityZone: us-west-1a
cidrBlock: 10.0.0.0/24
id: subnet-04069978047301fce
isPublic: false
routeTableId: rtb-06e5b16760a136a9b
tags:
Name: cluster-qmul89-subnet-private-us-west-1a
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-1a
cidrBlock: 10.0.1.0/24
id: subnet-057e208911a7100a9
isPublic: true
natGatewayId: nat-02b99bb47ed11bab0
routeTableId: rtb-0c1181c7a47238747
tags:
Name: cluster-qmul89-subnet-public-us-west-1a
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
- availabilityZone: us-west-1c
cidrBlock: 10.0.2.0/24
id: subnet-0d987044191d6131a
isPublic: false
routeTableId: rtb-0c19e5639177973ae
tags:
Name: cluster-qmul89-subnet-private-us-west-1c
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/internal-elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: private
- availabilityZone: us-west-1c
cidrBlock: 10.0.3.0/24
id: subnet-006c42a116e38379a
isPublic: true
natGatewayId: nat-018176214822b0de8
routeTableId: rtb-03e0196d18896750b
tags:
Name: cluster-qmul89-subnet-public-us-west-1c
kubernetes.io/cluster/cluster-qmul89: shared
kubernetes.io/role/elb: "1"
sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
sigs.k8s.io/cluster-api-provider-aws/role: public
network:
subnets:
- availabilityZone: us-west-1a
cidrBlock: 10.0.0.0/24
- availabilityZone: us-west-1a
cidrBlock: 10.0.1.0/24
- availabilityZone: us-west-1c
cidrBlock: 10.0.2.0/24
- availabilityZone: us-west-1c
cidrBlock: 10.0.3.0/24 This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs. FixesWhile the long-term solution is waiting for the fix of kubernetes-sigs/cluster-api#6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
This should have been fixed with SSA support in CAPA. |
/kind bug
What steps did you take and what happened:
Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error,
failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks
AWSCluster
AWSMachine
While
sg-0b2785eae128cccad
belongs to a CAPA created VPC,subnet-8b13d7d6
belongs to a default VPC in the region. Note thatsubnet-8b13d7d6
is not referenced in AWSCluster or AWSMachine spec.What did you expect to happen:
Cluster creation is successful.
Anything else you would like to add:
Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.
Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.
Environment:
kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: