Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subnet destroy fails immediately after cluster destroy unless delay added #2779

Closed
ocofaigh opened this issue Jun 24, 2021 · 8 comments
Closed
Labels
service/Kubernetes Service Issues related to Kubernetes Service Issues service/VPC Infrastructure Issues related to the VPC Infrastructure

Comments

@ocofaigh
Copy link
Contributor

When destroying a VPC Gen2 Openshift cluster the provider does not wait long enough for the environment to clean up before completing as success. The issue here is that if a subnet is getting destroyed straight after the cluster is destroyed, some network resources have not finished destroying. This causes the destroy of the subnet to fail as there are still attached network resources. The temporary workaround is to put a 10-minute pause between the cluster and subnet steps on destroy (we needed to add a 10 mins wait in popular regions like us-south, but in au-syd it only needed ~2 mins).

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform IBM Provider Version

$ terraform -v
Terraform v0.15.3
on darwin_amd64

Affected Resource(s)

  • ibm_container_vpc_cluster
  • ibm_is_subnet

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

##############################################################################
# Versions + Providers
##############################################################################

terraform {
  required_providers {
    ibm = {
      source  = "ibm-cloud/ibm"
      version = ">= 1.26.0"
    }
  }
  required_version = ">= 0.15"
}

provider "ibm" {
  ibmcloud_api_key = var.ibmcloud_api_key
}

##############################################################################
# Variables
##############################################################################

variable "ibmcloud_api_key" {
  type        = string
  description = "The IBM Cloud api token"
}

##############################################################################
# Locals
##############################################################################

locals {
  prefix = "test-base-oc-vpc-module"
  region = "us-south"
  cidr_bases = {
    private = "172.29.0.0/21",
    transit = "172.30.0.0/21",
    edge    = "172.31.0.0/21"
  }
  cidr_blocks = ["10.10.10.0/24", "10.10.11.0/24", "10.10.12.0/24"]
  default_pool = element([
    for pool in var.worker_pools :
    pool if pool.pool_name == "default"
  ], 0)
  other_pools = [
    for pool in var.worker_pools :
    pool if pool.pool_name != "default"
  ]
  kube_version     = "${var.ocp_version}_openshift"
  cos_name         = var.use_existing_cos == true || (var.use_existing_cos == false && var.cos_name != null) ? var.cos_name : "${var.cluster_name}_cos"
  cos_location     = "global"
  cos_plan         = "standard"
  storage_class    = "standard"
  cos_instance_crn = var.use_existing_cos != false ? var.existing_cos_id : ibm_resource_instance.cos_instance[0].id
  # Validation approach based on https://stackoverflow.com/a/66682419
  validate_condition = var.use_existing_cos == true && var.existing_cos_id == null
  validate_msg       = "A value for 'existing_cos_id' variable must be passed when 'use_existing_cos = true'"
  validate_check = regex(
    "^${local.validate_msg}$",
    (!local.validate_condition
      ? local.validate_msg
  : ""))

}

##############################################################################
# Resource Group
##############################################################################

resource "ibm_resource_group" "test_resource_group" {
  name     = "${local.prefix}-resource-group"
  quota_id = null
}

##############################################################################
# VPC
##############################################################################

resource "ibm_is_vpc" "test_vpc" {

  depends_on     = [ibm_resource_group.test_resource_group]
  name           = "${local.prefix}-vpc"
  resource_group = ibm_resource_group.test_resource_group.id
}

##############################################################################
# Address Prefix
##############################################################################

resource "ibm_is_vpc_address_prefix" "subnet_prefix" {
  depends_on = [ibm_is_vpc.test_vpc]
  count      = 3
  name       = "${keys(local.cidr_bases)[count.index]}-prefix-zone-${(count.index % 3) + 1}"
  zone       = "${local.region}-${(count.index % 3) + 1}"
  vpc        = ibm_is_vpc.test_vpc.id
  cidr       = element(local.cidr_blocks, count.index)
}

##############################################################################
# Subnets
##############################################################################

resource "ibm_is_subnet" "subnet" {
  depends_on      = [ibm_is_vpc_address_prefix.subnet_prefix]
  count           = 3
  name            = "${local.prefix}-${keys(local.cidr_bases)[count.index]}-subnet"
  vpc             = ibm_is_vpc.test_vpc.id
  resource_group  = ibm_resource_group.test_resource_group.id
  zone            = "${local.region}-${(count.index % 3) + 1}"
  ipv4_cidr_block = length(local.cidr_blocks) > 0 ? element(ibm_is_vpc_address_prefix.subnet_prefix.*.cidr, count.index) : null
}

##############################################################################
# COS
##############################################################################

resource "ibm_resource_instance" "cos_instance" {
  count = var.use_existing_cos ? 0 : 1

  name              = local.cos_name
  resource_group_id = var.resource_group_id
  service           = "cloud-object-storage"
  plan              = local.cos_plan
  location          = local.cos_location
}

##############################################################################
# Cluster
##############################################################################

resource "ibm_container_vpc_cluster" "cluster" {
  name                            = var.cluster_name
  vpc_id                          = var.vpc_id
  kube_version                    = local.kube_version
  flavor                          = local.default_pool.machine_type
  entitlement                     = var.ocp_entitlement
  cos_instance_crn                = local.cos_instance_crn
  worker_count                    = local.default_pool.workers_per_zone
  resource_group_id               = var.resource_group_id
  wait_till                       = var.cluster_ready_when
  force_delete_storage            = var.force_delete_storage
  disable_public_service_endpoint = var.disable_public_endpoint

  // default workers are mapped to the subnets that are "private"
  dynamic "zones" {
    for_each = [
      for subnet in data.ibm_is_subnets.all_subnets.subnets :
      subnet if length(regexall(".+-${local.default_pool.subnet_prefix}-.+", subnet.name)) > 0 && subnet.vpc == var.vpc_id
    ]
    content {
      subnet_id = zones.value.id
      name      = zones.value.zone
    }
  }
}

##############################################################################
# Worker Pools
##############################################################################

resource "ibm_container_vpc_worker_pool" "pool" {
  for_each          = { for pool in local.other_pools : pool.pool_name => pool }
  vpc_id            = var.vpc_id
  resource_group_id = var.resource_group_id
  cluster           = ibm_container_vpc_cluster.cluster.id
  worker_pool_name  = each.value.pool_name
  flavor            = each.value.machine_type
  worker_count      = each.value.workers_per_zone

  dynamic "zones" {
    for_each = [
      for subnet in data.ibm_is_subnets.all_subnets.subnets :
      subnet if length(regexall(".+-${each.value.subnet_prefix}-.+", subnet.name)) > 0 && subnet.vpc == var.vpc_id
    ]
    content {
      subnet_id = zones.value.id
      name      = zones.value.zone
    }
  }
}

Debug Output

Panic Output

Expected Behavior

The destroy of the cluster should not complete if there are still attached network resources (subnets) which cannot be destroyed yet.

Actual Behavior

A destroy of subnets being used by the cluster failed because they were still attached to something on the back end.

Steps to Reproduce

  1. terraform apply
  2. terraform destroy

Important Factoids

Workaround we are using:

resource "time_sleep" "wait_600_seconds" {
  depends_on = [ibm_is_subnet.subnet]

  destroy_duration = "600s"
}

References

  • #0000
@kavya498 kavya498 added service/Kubernetes Service Issues related to Kubernetes Service Issues service/VPC Infrastructure Issues related to the VPC Infrastructure labels Jun 25, 2021
@kavya498
Copy link
Collaborator

Can we get error log of subnet destroy?

@ocofaigh
Copy link
Contributor Author

@kavya498 Here you go:

│ Error: Error Deleting Subnet : Cannot delete the subnet while it is in use by one or more network interfaces. Please delete the network interfaces or their associated servers and retry: [  instances:0727-d8b55863-1000-49f6-b9ea-569559e8ac77-wsxq2  ].
│ {
│     "StatusCode": 409,
│     "Headers": {
│         "Cache-Control": [
│             "max-age=0, no-cache, no-store, must-revalidate"
│         ],
│         "Cf-Cache-Status": [
│             "DYNAMIC"
│         ],
│         "Cf-Ray": [
│             "66772949ec455956-IAD"
│         ],
│         "Cf-Request-Id": [
│             "0afe58223200005956ef351000000001"
│         ],
│         "Content-Length": [
│             "435"
│         ],
│         "Content-Type": [
│             "application/json; charset=utf-8"
│         ],
│         "Date": [
│             "Wed, 30 Jun 2021 11:47:46 GMT"
│         ],
│         "Expect-Ct": [
│             "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
│         ],
│         "Expires": [
│             "-1"
│         ],
│         "Pragma": [
│             "no-cache"
│         ],
│         "Server": [
│             "cloudflare"
│         ],
│         "Strict-Transport-Security": [
│             "max-age=31536000; includeSubDomains"
│         ],
│         "Vary": [
│             "Accept-Encoding"
│         ],
│         "X-Content-Type-Options": [
│             "nosniff"
│         ],
│         "X-Request-Id": [
│             "cf980841-1169-4092-988c-8e054d90053d"
│         ],
│         "X-Xss-Protection": [
│             "1; mode=block"
│         ]
│     },
│     "Result": {
│         "errors": [
│             {
│                 "code": "subnet_in_use_network_interface_exists",
│                 "message": "Cannot delete the subnet while it is in use by one or more network interfaces. Please delete the network interfaces or their associated servers and retry: [  instances:0727-d8b55863-1000-49f6-b9ea-569559e8ac77-wsxq2  ].",
│                 "target": {
│                     "name": "id",
│                     "type": "parameter",
│                     "value": "0727-8e74166f-503e-4b47-96ac-fa54fc440b61"
│                 }
│             }
│         ],
│         "trace": "cf980841-1169-4092-988c-8e054d90053d"
│     },
│     "RawResult": null
│ }
│ 
│ 
│ 
╵

@astha-jain
Copy link
Contributor

@deepaksibm FYI

@ocofaigh
Copy link
Contributor Author

ocofaigh commented Jul 14, 2021

@kavya498 Has anyone looked into this? I think this happens if you create an OpenShift VPC Gen2 cluster, and then destroy it, and the subnets straight away. The problem is the VPC load balancer that is auto created by the ingress running on the cluster is actually still creating, so by the time the subnet destroy is attempted, even though the cluster is blown away, the VPC load balancer is still in a creation state, and so the subnet cannot be blown away.

@deepaksibm
Copy link
Contributor

Hi @ocofaigh , we are working on this issue. will roll out a possible fix soon. Will keep posted.

@ocofaigh
Copy link
Contributor Author

ocofaigh commented Aug 4, 2021

@deepaksibm I see #2895 was merged, can you confirm what version of the ibm provider it is in?

@dnwe
Copy link
Contributor

dnwe commented Aug 4, 2021

Also note that in the merged PR the err returned from the new retry func (if it didn’t succeed and returned one) is not getting checked https://github.com/IBM-Cloud/terraform-provider-ibm/pull/2895/files#diff-8aa3d2a6377c1a49177482cb1ef79891182de2ee27fc46b2e75baf097868bc28R623 and will instead be ignored and overwritten by the subsequent isWaitForSubnetDeleted call

@kavya498
Copy link
Collaborator

kavya498 commented Oct 6, 2021

Available in 1.30.0..
Closing this issue..
Thanks..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service/Kubernetes Service Issues related to Kubernetes Service Issues service/VPC Infrastructure Issues related to the VPC Infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants