subnet destroy fails immediately after cluster destroy unless delay added #2779

ocofaigh · 2021-06-24T13:54:45Z

When destroying a VPC Gen2 Openshift cluster the provider does not wait long enough for the environment to clean up before completing as success. The issue here is that if a subnet is getting destroyed straight after the cluster is destroyed, some network resources have not finished destroying. This causes the destroy of the subnet to fail as there are still attached network resources. The temporary workaround is to put a 10-minute pause between the cluster and subnet steps on destroy (we needed to add a 10 mins wait in popular regions like us-south, but in au-syd it only needed ~2 mins).

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform IBM Provider Version

$ terraform -v
Terraform v0.15.3
on darwin_amd64

Affected Resource(s)

ibm_container_vpc_cluster
ibm_is_subnet

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

##############################################################################
# Versions + Providers
##############################################################################

terraform {
  required_providers {
    ibm = {
      source  = "ibm-cloud/ibm"
      version = ">= 1.26.0"
    }
  }
  required_version = ">= 0.15"
}

provider "ibm" {
  ibmcloud_api_key = var.ibmcloud_api_key
}

##############################################################################
# Variables
##############################################################################

variable "ibmcloud_api_key" {
  type        = string
  description = "The IBM Cloud api token"
}

##############################################################################
# Locals
##############################################################################

locals {
  prefix = "test-base-oc-vpc-module"
  region = "us-south"
  cidr_bases = {
    private = "172.29.0.0/21",
    transit = "172.30.0.0/21",
    edge    = "172.31.0.0/21"
  }
  cidr_blocks = ["10.10.10.0/24", "10.10.11.0/24", "10.10.12.0/24"]
  default_pool = element([
    for pool in var.worker_pools :
    pool if pool.pool_name == "default"
  ], 0)
  other_pools = [
    for pool in var.worker_pools :
    pool if pool.pool_name != "default"
  ]
  kube_version     = "${var.ocp_version}_openshift"
  cos_name         = var.use_existing_cos == true || (var.use_existing_cos == false && var.cos_name != null) ? var.cos_name : "${var.cluster_name}_cos"
  cos_location     = "global"
  cos_plan         = "standard"
  storage_class    = "standard"
  cos_instance_crn = var.use_existing_cos != false ? var.existing_cos_id : ibm_resource_instance.cos_instance[0].id
  # Validation approach based on https://stackoverflow.com/a/66682419
  validate_condition = var.use_existing_cos == true && var.existing_cos_id == null
  validate_msg       = "A value for 'existing_cos_id' variable must be passed when 'use_existing_cos = true'"
  validate_check = regex(
    "^${local.validate_msg}$",
    (!local.validate_condition
      ? local.validate_msg
  : ""))

}

##############################################################################
# Resource Group
##############################################################################

resource "ibm_resource_group" "test_resource_group" {
  name     = "${local.prefix}-resource-group"
  quota_id = null
}

##############################################################################
# VPC
##############################################################################

resource "ibm_is_vpc" "test_vpc" {

  depends_on     = [ibm_resource_group.test_resource_group]
  name           = "${local.prefix}-vpc"
  resource_group = ibm_resource_group.test_resource_group.id
}

##############################################################################
# Address Prefix
##############################################################################

resource "ibm_is_vpc_address_prefix" "subnet_prefix" {
  depends_on = [ibm_is_vpc.test_vpc]
  count      = 3
  name       = "${keys(local.cidr_bases)[count.index]}-prefix-zone-${(count.index % 3) + 1}"
  zone       = "${local.region}-${(count.index % 3) + 1}"
  vpc        = ibm_is_vpc.test_vpc.id
  cidr       = element(local.cidr_blocks, count.index)
}

##############################################################################
# Subnets
##############################################################################

resource "ibm_is_subnet" "subnet" {
  depends_on      = [ibm_is_vpc_address_prefix.subnet_prefix]
  count           = 3
  name            = "${local.prefix}-${keys(local.cidr_bases)[count.index]}-subnet"
  vpc             = ibm_is_vpc.test_vpc.id
  resource_group  = ibm_resource_group.test_resource_group.id
  zone            = "${local.region}-${(count.index % 3) + 1}"
  ipv4_cidr_block = length(local.cidr_blocks) > 0 ? element(ibm_is_vpc_address_prefix.subnet_prefix.*.cidr, count.index) : null
}

##############################################################################
# COS
##############################################################################

resource "ibm_resource_instance" "cos_instance" {
  count = var.use_existing_cos ? 0 : 1

  name              = local.cos_name
  resource_group_id = var.resource_group_id
  service           = "cloud-object-storage"
  plan              = local.cos_plan
  location          = local.cos_location
}

##############################################################################
# Cluster
##############################################################################

resource "ibm_container_vpc_cluster" "cluster" {
  name                            = var.cluster_name
  vpc_id                          = var.vpc_id
  kube_version                    = local.kube_version
  flavor                          = local.default_pool.machine_type
  entitlement                     = var.ocp_entitlement
  cos_instance_crn                = local.cos_instance_crn
  worker_count                    = local.default_pool.workers_per_zone
  resource_group_id               = var.resource_group_id
  wait_till                       = var.cluster_ready_when
  force_delete_storage            = var.force_delete_storage
  disable_public_service_endpoint = var.disable_public_endpoint

  // default workers are mapped to the subnets that are "private"
  dynamic "zones" {
    for_each = [
      for subnet in data.ibm_is_subnets.all_subnets.subnets :
      subnet if length(regexall(".+-${local.default_pool.subnet_prefix}-.+", subnet.name)) > 0 && subnet.vpc == var.vpc_id
    ]
    content {
      subnet_id = zones.value.id
      name      = zones.value.zone
    }
  }
}

##############################################################################
# Worker Pools
##############################################################################

resource "ibm_container_vpc_worker_pool" "pool" {
  for_each          = { for pool in local.other_pools : pool.pool_name => pool }
  vpc_id            = var.vpc_id
  resource_group_id = var.resource_group_id
  cluster           = ibm_container_vpc_cluster.cluster.id
  worker_pool_name  = each.value.pool_name
  flavor            = each.value.machine_type
  worker_count      = each.value.workers_per_zone

  dynamic "zones" {
    for_each = [
      for subnet in data.ibm_is_subnets.all_subnets.subnets :
      subnet if length(regexall(".+-${each.value.subnet_prefix}-.+", subnet.name)) > 0 && subnet.vpc == var.vpc_id
    ]
    content {
      subnet_id = zones.value.id
      name      = zones.value.zone
    }
  }
}

Debug Output

Panic Output

Expected Behavior

The destroy of the cluster should not complete if there are still attached network resources (subnets) which cannot be destroyed yet.

Actual Behavior

A destroy of subnets being used by the cluster failed because they were still attached to something on the back end.

Steps to Reproduce

terraform apply
terraform destroy

Important Factoids

Workaround we are using:

resource "time_sleep" "wait_600_seconds" {
  depends_on = [ibm_is_subnet.subnet]

  destroy_duration = "600s"
}

References

#0000

The text was updated successfully, but these errors were encountered:

kavya498 · 2021-06-30T06:31:11Z

Can we get error log of subnet destroy?

ocofaigh · 2021-06-30T12:27:09Z

@kavya498 Here you go:

│ Error: Error Deleting Subnet : Cannot delete the subnet while it is in use by one or more network interfaces. Please delete the network interfaces or their associated servers and retry: [  instances:0727-d8b55863-1000-49f6-b9ea-569559e8ac77-wsxq2  ].
│ {
│     "StatusCode": 409,
│     "Headers": {
│         "Cache-Control": [
│             "max-age=0, no-cache, no-store, must-revalidate"
│         ],
│         "Cf-Cache-Status": [
│             "DYNAMIC"
│         ],
│         "Cf-Ray": [
│             "66772949ec455956-IAD"
│         ],
│         "Cf-Request-Id": [
│             "0afe58223200005956ef351000000001"
│         ],
│         "Content-Length": [
│             "435"
│         ],
│         "Content-Type": [
│             "application/json; charset=utf-8"
│         ],
│         "Date": [
│             "Wed, 30 Jun 2021 11:47:46 GMT"
│         ],
│         "Expect-Ct": [
│             "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
│         ],
│         "Expires": [
│             "-1"
│         ],
│         "Pragma": [
│             "no-cache"
│         ],
│         "Server": [
│             "cloudflare"
│         ],
│         "Strict-Transport-Security": [
│             "max-age=31536000; includeSubDomains"
│         ],
│         "Vary": [
│             "Accept-Encoding"
│         ],
│         "X-Content-Type-Options": [
│             "nosniff"
│         ],
│         "X-Request-Id": [
│             "cf980841-1169-4092-988c-8e054d90053d"
│         ],
│         "X-Xss-Protection": [
│             "1; mode=block"
│         ]
│     },
│     "Result": {
│         "errors": [
│             {
│                 "code": "subnet_in_use_network_interface_exists",
│                 "message": "Cannot delete the subnet while it is in use by one or more network interfaces. Please delete the network interfaces or their associated servers and retry: [  instances:0727-d8b55863-1000-49f6-b9ea-569559e8ac77-wsxq2  ].",
│                 "target": {
│                     "name": "id",
│                     "type": "parameter",
│                     "value": "0727-8e74166f-503e-4b47-96ac-fa54fc440b61"
│                 }
│             }
│         ],
│         "trace": "cf980841-1169-4092-988c-8e054d90053d"
│     },
│     "RawResult": null
│ }
│ 
│ 
│ 
╵

astha-jain · 2021-07-06T22:52:59Z

@deepaksibm FYI

ocofaigh · 2021-07-14T14:37:22Z

@kavya498 Has anyone looked into this? I think this happens if you create an OpenShift VPC Gen2 cluster, and then destroy it, and the subnets straight away. The problem is the VPC load balancer that is auto created by the ingress running on the cluster is actually still creating, so by the time the subnet destroy is attempted, even though the cluster is blown away, the VPC load balancer is still in a creation state, and so the subnet cannot be blown away.

deepaksibm · 2021-07-14T15:22:24Z

Hi @ocofaigh , we are working on this issue. will roll out a possible fix soon. Will keep posted.

ocofaigh · 2021-08-04T12:15:45Z

@deepaksibm I see #2895 was merged, can you confirm what version of the ibm provider it is in?

dnwe · 2021-08-04T12:24:15Z

Also note that in the merged PR the err returned from the new retry func (if it didn’t succeed and returned one) is not getting checked https://github.com/IBM-Cloud/terraform-provider-ibm/pull/2895/files#diff-8aa3d2a6377c1a49177482cb1ef79891182de2ee27fc46b2e75baf097868bc28R623 and will instead be ignored and overwritten by the subsequent isWaitForSubnetDeleted call

kavya498 · 2021-10-06T05:56:22Z

Available in 1.30.0..
Closing this issue..
Thanks..

kavya498 added service/Kubernetes Service Issues related to Kubernetes Service Issues service/VPC Infrastructure Issues related to the VPC Infrastructure labels Jun 25, 2021

deepaksibm mentioned this issue Jul 20, 2021

bug - subnet destroy retry when attached resources in use #2895

Merged

deepaksibm mentioned this issue Aug 4, 2021

adding a condition to check err during destroy retry #2935

Merged

kavya498 closed this as completed Oct 6, 2021

This was referenced Dec 21, 2021

ibm_is_subnet destroy is not checking the return status #2968

Closed

Add Delete option in timeouts for ibm_container_vpc_alb resource #2667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subnet destroy fails immediately after cluster destroy unless delay added #2779

subnet destroy fails immediately after cluster destroy unless delay added #2779

ocofaigh commented Jun 24, 2021

kavya498 commented Jun 30, 2021

ocofaigh commented Jun 30, 2021

astha-jain commented Jul 6, 2021

ocofaigh commented Jul 14, 2021 •

edited

Loading

deepaksibm commented Jul 14, 2021

ocofaigh commented Aug 4, 2021

dnwe commented Aug 4, 2021 •

edited

Loading

kavya498 commented Oct 6, 2021

subnet destroy fails immediately after cluster destroy unless delay added #2779

subnet destroy fails immediately after cluster destroy unless delay added #2779

Comments

ocofaigh commented Jun 24, 2021

Community Note

Terraform CLI and Terraform IBM Provider Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

kavya498 commented Jun 30, 2021

ocofaigh commented Jun 30, 2021

astha-jain commented Jul 6, 2021

ocofaigh commented Jul 14, 2021 • edited Loading

deepaksibm commented Jul 14, 2021

ocofaigh commented Aug 4, 2021

dnwe commented Aug 4, 2021 • edited Loading

kavya498 commented Oct 6, 2021

ocofaigh commented Jul 14, 2021 •

edited

Loading

dnwe commented Aug 4, 2021 •

edited

Loading