Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

Closed
Tracked by #3530
pydctw opened this issue Apr 8, 2022 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@pydctw
Copy link
Contributor

pydctw commented Apr 8, 2022

/kind bug

What steps did you take and what happened:
Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error, failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks

E0408 17:14:02.788518       1 awsmachine_controller.go:497]  "msg"="unable to create instance" "error"="failed to create AWSMachine instance: failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks.\n\tstatus code: 400, request id: 3b28054f-c5e8-439c-bac3-0dda24431a27" 

AWSCluster

spec:
  network:
    subnets:
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.0.0/24
      id: subnet-0176a425f63781f71
      isPublic: false
      routeTableId: rtb-08275750c99fb2f3a
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.1.0/24
      id: subnet-0464f24a3d364523f
      isPublic: true
      natGatewayId: nat-0d185215367997610
      routeTableId: rtb-051ced5fd65ae6600
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.2.0/24
      id: subnet-07023ae2d872062bf
      isPublic: false
      routeTableId: rtb-001434b0c17f5b0f4
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.3.0/24
      id: subnet-00b92bd396eef0bf2
      isPublic: true
      natGatewayId: nat-0baa238a24de3b142
      routeTableId: rtb-0e3643d5f8b441ed9
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

AWSMachine

spec:
  ami: {}
  cloudInit:
    secureSecretsBackend: secrets-manager
  iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
  instanceID: i-0ed41df7645f74b06
  instanceType: t3.large
  providerID: aws:///us-west-2b/i-0ed41df7645f74b06
  sshKeyName: cluster-api-provider-aws-sigs-k8s-io

While sg-0b2785eae128cccad belongs to a CAPA created VPC, subnet-8b13d7d6 belongs to a default VPC in the region. Note that subnet-8b13d7d6 is not referenced in AWSCluster or AWSMachine spec.

What did you expect to happen:
Cluster creation is successful.

Anything else you would like to add:
Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.

Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2022
@sedefsavas
Copy link
Contributor

/triage accepted
/priority critical-urgent
/milestone v1.5.1

@k8s-ci-robot
Copy link
Contributor

@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.5.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]

Use /milestone clear to clear the milestone.

In response to this:

/triage accepted
/priority critical-urgent
/milestone v1.5.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Apr 8, 2022
@sedefsavas sedefsavas added this to the V1.5.1 milestone Apr 8, 2022
@sedefsavas
Copy link
Contributor

sedefsavas commented Apr 8, 2022

AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass.

If so, I will reduce the priority accordingly.

@pydctw
Copy link
Contributor Author

pydctw commented Apr 11, 2022

This is the first time I've seen the error and hence agree that it is ClusterClass related.

@pydctw
Copy link
Contributor Author

pydctw commented Apr 13, 2022

This was such a fascinating and difficult issue to debug.

Observations

  • The issue happens randomly. Instance creation can fail at 1st, 2nd or 3rd CP creation.
  • Cluster creation is successful most of the times and it fails with the issue sometimes.
  • A cluster that failed an e2e test due to time out while waiting for a control plane eventually created an instance and cluster became ready.

Debugging

For a failed instance creation, below is input sent to AWS API.

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

Compare it with an input for successful case

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "subnetId": "subnet-04069978047301fce",
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

The difference is that the failed case doesn't have subnetId, which makes AWS to pick a random subnet for the instance, in this case a subnet in default VPC.

Root Cause Analysis

This happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

AWSCluster subnet spec oscillates between two states with ClusterClass.

  • After CAPA patched
  network:
    ...
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
      id: subnet-04069978047301fce
      isPublic: false
      routeTableId: rtb-06e5b16760a136a9b
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
      id: subnet-057e208911a7100a9
      isPublic: true
      natGatewayId: nat-02b99bb47ed11bab0
      routeTableId: rtb-0c1181c7a47238747
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
      id: subnet-0d987044191d6131a
      isPublic: false
      routeTableId: rtb-0c19e5639177973ae
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24
      id: subnet-006c42a116e38379a
      isPublic: true
      natGatewayId: nat-018176214822b0de8
      routeTableId: rtb-03e0196d18896750b
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
  • After CAPI patched
  network:
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24

This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs.
So subnet ID is empty here - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/ec2/instances.go#L340

Fixes

While the long-term solution is waiting for the fix of kubernetes-sigs/cluster-api#6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@pydctw
Copy link
Contributor Author

pydctw commented Jul 12, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 10, 2022
@pydctw
Copy link
Contributor Author

pydctw commented Nov 10, 2022

This should have been fixed with SSA support in CAPA.

@pydctw pydctw closed this as completed Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants