Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroying Security Groups Takes Forever with Attached SG #265

Closed
hashibot opened this issue Jun 13, 2017 · 43 comments · Fixed by #30114
Closed

Destroying Security Groups Takes Forever with Attached SG #265

hashibot opened this issue Jun 13, 2017 · 43 comments · Fixed by #30114
Assignees
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. upstream Addresses functionality related to the cloud provider.
Milestone

Comments

@hashibot
Copy link

This issue was originally opened by @kojiromike as hashicorp/terraform#8772. It was migrated here as part of the provider split. The original body of the issue is below.


Hi there,

Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

Terraform v0.7.3

Affected Resource(s)

Please list the resources as a list, for example:

  • aws_security_group

Expected Behavior

Terraform should always update aws_instance security group attachments before trying to destroy an existing security group.

Actual Behavior

Terraform tried to destroy security groups that were attached to an instance. The apply ran for 5 minutes before timing out. Then, Terraform attempted to continue applying by creating new security groups, which failed because the old security groups still existed.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. Create a security group and attach it to an instance. Plan and apply.
  2. Change something so that sg gets recreated. (Taint that security group, perhaps?)
  3. Plan and apply.
@hashibot hashibot added the bug Addresses a defect in current functionality. label Jun 13, 2017
@camerondavison
Copy link

I think this is probably an ordering thing? maybe if the SG is tainted and needs to be destroyed then it needs to taint anything that is using it. AWS will not allow you to destroy a security group that has instances associated with it.

@geoffreywiseman
Copy link

Is there any way it can detect this condition? I had this issue with a security group that I had imported that I hadn't put the instance attachment into terraform yet -- so Terraform hadn't been told the security group was attached, but I wanted to rename the group, and so it basically started tearing down the group and got stuck. Rather than getting stuck, I would prefer Terraform to simply fail when it plans to tear down a group that is attached to an instance, or, at least, fail when it tries and AWS prevents the deletion.

@GroovyCarrot
Copy link

This was easily resolved for me by just commenting out the security group id for the instance that is also terraform managed to encourage it to detach it first, and then adding it back in later.

I feel like there's a graceful way that this could be managed, however it probably requires a resource understanding how it interact with another beyond terraform queuing based on interpolation; at least where these quirks occur. So for example, the aws_security_group resource knows that when being renamed/destroyed, it needs to check any aws_instance resources for a matching entry in vpc_security_group_ids in order to detach that first.

It at least certainly needs to fail for situations where the attachment is not terraform managed, rather than hanging indefinitely.

@Jonnymcc
Copy link
Contributor

Jonnymcc commented Jan 3, 2018

Would be nice if the security_group resource can detect what it is attached to and then detach from those resources, based on some flag on the security_group resource. Kind of like a resource specific lifecycle action.

@radeksimko radeksimko added the service/ec2 Issues and PRs that pertain to the ec2 service. label Jan 25, 2018
@twall
Copy link

twall commented Apr 25, 2018

It is really unexpected to not have any dependency management for security groups, especially since something as simple as a rename requires the resource to be re-created.

@boeboe
Copy link

boeboe commented May 23, 2018

Any update on this issue?

@lazzarello
Copy link

I was able to work around this on a smaller deployment (2 instances) by manually detaching the security groups. It seems like there might be something in the docs about using aws_network_interface_sg_attachment in different contexts.

@DanielMarquard
Copy link

Same problem here. I'm seeing the issue with spot instance requests, specifically.

@Botono
Copy link

Botono commented Mar 1, 2019

Seeing this issue today with security groups attached to Lambda functions.

@chillerby
Copy link

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

@jeohist
Copy link
Contributor

jeohist commented Mar 11, 2019

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).

@ryudice
Copy link

ryudice commented Jun 18, 2019

Experiencing this issue also... 2 years after being reported...

@Dzhuneyt
Copy link
Contributor

Dzhuneyt commented Aug 7, 2019

I don't understand how this wasn't addressed yet, given the wide usage of Security Groups on AWS.

@babbottscott
Copy link

babbottscott commented Aug 13, 2019

Would love to see sg assignment be a separate resource. At the least, though, it seems reasonable to have sg destruction detach before destroy. I suppose, though, that the SG to be destroyed may have been assigned to an instance not in $thisrun's state.

@eduardohki
Copy link

eduardohki commented Sep 25, 2019

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

@chillerby In your case, the rules were nested into aws_security_group or are they defined in separate aws_security_group_rule resources?

I used the first and your the lifecycle entry didn't make a difference...

@gabrielerzinger
Copy link

gabrielerzinger commented Dec 5, 2019

Any news on this?

Edit: could make it work with @chillerby suggestion, had to add the lifecyle to all the chain using the sg, tks!

@ghost
Copy link

ghost commented Feb 20, 2020

Bump. This is quite a frustrating issue.

@stuartabramshumphries
Copy link

still an issue, no way to get terraform to recognise a security groups name is being changed without it hanging.

@stuartabramshumphries
Copy link

needed to create new security group, attach that to instances, then when terraform had finished and checked old sg not being referenced I could remove old sg from tf and ran again fine.
adding lifecycle to chain didnt seem to work though..

@ahathaway
Copy link

Such a terrible issue to not have any traction.

@ado120
Copy link

ado120 commented Apr 1, 2020

Same issue for me as well. This is a huge one, any fixes?

@abeluck
Copy link

abeluck commented Apr 6, 2020

Related (closed and locked) issue: hashicorp/terraform#8617

This is quite unfortunate.

@gaurish
Copy link

gaurish commented Apr 15, 2020

I am still seeing this issue on

Terraform v0.12.24

  • provider.aws v2.57.0

@korobass
Copy link

I can confirm the issue occurs for me as well. When I run terraform destroy, security groups are not detached from EKS cluster network interfaces, even if there are other sgs attached to them.

terraform version:
Terraform v0.12.20
provider.aws 2.7.0

@MicahCarrick
Copy link

MicahCarrick commented May 17, 2020

I came across this issue recently and thought I would offer my solution. By adding this lifecycle block:

lifecycle { create_before_destroy = true }

within the security group resource, Terraform will create the new security group and attach it to the instance before trying to destroy the existing security group. By doing this, the existing security group is detached and can be destroyed without issue.

This will also not work if explicitly assigning a name for the security group as you cannot have duplicate names.

You can append random digits from the random_id resource to the name of the security group which will then work with the create_before_destroy workaround.

@Roxyrob
Copy link

Roxyrob commented Jul 17, 2020

Hi @MicahCarrick

I apply lifecycle solution (seen in #2445). It warked for me also with explicit security group name defined (without random part).

I don't know how but terraform seems to handle name conflict.

@arcreative
Copy link

arcreative commented Oct 25, 2020

lifecycle { create_before_destroy = true } does indeed break for named SGs, but I've confirmed that you can both make the aforementioned changed and just comment out your names, then when the apply is complete, uncomment your names and reapply.

It's a bit annoying to do it in 2 steps, but it seems unavoidable due to AWS' name collision error... unless the provider detected and handled the name collision using multiple steps.

@JCarrascoF
Copy link

Still happening...
I hope this could be solved (not using the lifecycle option) with creation before destroy as default SG strategy.

Thanks to everyone who proposed the lifecycle workaround. It works.

@magic69
Copy link

magic69 commented Jun 8, 2021

Really ? What is wrong with Hashicorp lately ? Started with the weird upgrade process that breaks everything and now this 7 year old bug ?

@kingindanord
Copy link

Still happening...

@awilkins
Copy link

This works until you are at the security group limit per network interface (i.e. you have a limit of 5 security groups per network interface, and 5 security groups are attached, it will try to attach a 6th security group and fail).

See also #13593 - it's also impossible to remove the last security group from an ENI, and the "default" group is not a good stopgap for this transition because it's too permissive (by default).

@GeekEast
Copy link

still causung frustration

@regmicmahesh
Copy link

still facing same problem

@rirze
Copy link

rirze commented Nov 16, 2021

Looking further into this issue is super depressing-- from what I can tell, this issue could be resolved with hashicorp/terraform#16065 but that has been open since 2017.

As for my case, I can't recreate a SG because an existing SG (defined in my templates) references it. For some reason. Terraform keeps trying to delete it naively without travelling the dependency graph and deleting dependent objects in order. One would assume that a infrastructure of code tool would understand how to do this basic operation by default.

@pspot2
Copy link

pspot2 commented Dec 23, 2021

terraform is 🗑️

What would be a better alternative?

@pspot2
Copy link

pspot2 commented Dec 27, 2021

Hi all,

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

As with many other Terraform shortcomings, the lifeboat in this case are local provisioners (again). In a nutshell:

  1. Add destroy-time local-exec provisioner to your SG resource. This provisioner will run an AWS CLI command to detach the SG from your downstream resource (VPCe, EC2 instance etc) and replace it with the default SG before the destroy logic is executed. The catch here is that this command needs to know the ID/name of your downstream resource. As the destroy-time provisioner does not allow to reference external resources, you'll need to add a couple of tags to your SG resource that would identify your downstream resource and which could be read by AWS CLI. As this destroy-time provisioner would be run both for replace as well as pure destroy operations, it is important to ignore its errors by setting on_failure = continue.

  2. Create a new null resource and add a create-time local-exec provisioner to it. This provisioner will run an AWS CLI command to remove the default SG from your downstream resource. This is cleanup action after the destroy-time provisioner described above. In order to execute this action only if the IDs of the attached SGs have changed, add a triggers block as described below in the examples. As the action of this null resource is not always required (specifically, when creating the resources from scratch), it is important to ignore its errors by setting on_failure = continue. Note: the reason for creating a separate null resource instead of embedding this provisioner into the downstream resource is that we need the provisioner to run after the downstream resource has been created.

Here is a full code example. My downstream resource in this example is a VPC endpoint. Your mileage may vary.

data "aws_security_group" "default" {
    name   = "default"
    vpc_id = var.vpc_id
}

resource "aws_security_group" "private_link_sg" {
    name        = "my_beautiful_sg"
    vpc_id      = var.vpc_id

    # The downstream resource must have at least one SG attached, therefore we
    # attach the default SG of the VPC temporarily and remove it later on
    provisioner "local-exec" {
        when    = destroy
        command = <<DOC
            ENDPOINT_ID=`aws ec2 describe-vpc-endpoints --filters "Name=tag:Name,Values=${self.tags.workaround1}" --query "VpcEndpoints[0].VpcEndpointId" --output text` &&
            aws ec2 modify-vpc-endpoint --vpc-endpoint-id $${ENDPOINT_ID} --add-security-group-ids ${self.tags.workaround2} --remove-security-group-ids ${self.id}
        DOC
        on_failure = continue
    }

    tags = {
        # Destroy-stage provisioners do not allow referencing anything except
        # internal attributes (self, count and each.key). The required values
        # must therefore be added to the tags upfront, which can then be
        # referenced using 'self'.
        workaround1 = var.tags_vpc_endpoint.Name
        workaround2 = data.aws_security_group.default.id
    }
}

resource "aws_vpc_endpoint" "private_link" {
    # here goes the downstream resource...
    ...
    security_group_ids  = [aws_security_group.private_link_sg.id]
    ...
}

resource "null_resource" "remove_default_sg" {
    provisioner "local-exec" {
        command = <<DOC
            aws ec2 modify-vpc-endpoint --vpc-endpoint-id ${aws_vpc_endpoint.private_link.id} --remove-security-group-ids ${data.aws_security_group.default.id}
        DOC
        on_failure = continue
    }

    triggers = {
        rerun_upon_change_of = join(",", aws_vpc_endpoint.private_link.security_group_ids)
    }
}

@Etiene
Copy link

Etiene commented Nov 4, 2022

@pspot2

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!

I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:

  • The null-resource and swapping the SG for the default SG weren't needed because the Lambda doesn't require always having one
  • The destroy provisioner needed to be in the lambda resource, not in the SG
  • At the destroy provisioner command, used aws lambda update-function-configuration to clear up the VPC config from the function. Doing this before deleting the function will immediately mark the ENIs for automatic deletion.
  • Used aws ec2 describe-network-interfaces to check the ENIs status and sleep while they still exist or the status is in-use. I didn't need to try force deleting the ENIs too because they are automatically deleted by AWS as soon as they become available. This will wait before the rest of terraform destroy can go on and then the function and the SG will get destroyed just fine. This step is necessary because if the function is destroyed before the ENIs are done being cleaned, they still take long to be destroyed just like before.

@srakshe
Copy link

srakshe commented Nov 8, 2022

@pspot2

here is another workaround that doesn't require any manual actions. The trade-off is a bit more ugliness in your code.

Thanks for suggesting using local-exec, this was the only thing that fixed it for me! 🎉 It reduced my terraform destroy time from over 30 minutes to less than 5 minutes!

I was dealing with lingering ENIs created automatically by a lambda function, which were attached to the SG. Some slight variations for fixing this in my case included:

  • The null-resource and swapping the SG for the default SG weren't needed because the Lambda doesn't require always having one
  • The destroy provisioner needed to be in the lambda resource, not in the SG
  • At the destroy provisioner command, used aws lambda update-function-configuration to clear up the VPC config from the function. Doing this before deleting the function will immediately mark the ENIs for automatic deletion.
  • Used aws ec2 describe-network-interfaces to check the ENIs status and sleep while they still exist or the status is in-use. I didn't need to try force deleting the ENIs too because they are automatically deleted by AWS as soon as they become available. This will wait before the rest of terraform destroy can go on and then the function and the SG will get destroyed just fine. This step is necessary because if the function is destroyed before the ENIs are done being cleaned, they still take long to be destroyed just like before.

Could you post the code sample for the benefit of others ?

@antoniogonca95
Copy link

If you were wondering what the code @Etiene described could look like, here's an example I implemented to fix the exact same issue. I added this to my aws_lambda_function resource

  # This block prevents the lambda's destruction from getting stuck on lingering ENIs
  # It clears the vpc config, and waits until the ENIs have been automatically deleted
  # Before deleting the function + security group
  provisioner "local-exec" {
    when    = destroy
    command = <<DOC
        aws lambda update-function-configuration --function-name ${self.tags.function_name} --vpc-config '{"SubnetIds": [], "SecurityGroupIds": []}' &&
        while result=$(aws ec2 describe-network-interfaces --output text --filters '[{"Name": "group-id", "Values": ["${self.tags.security_group_id}"] }, {"Name": "status", "Values": ["in-use", "available"]}]'); test "$result" != ""; do
          sleep 10;
        done
    DOC
  }

  tags = {
    # Destroy-stage provisioners do not allow referencing anything except
    # internal attributes (self, count and each.key). The required values
    # must therefore be added to the tags upfront, which can then be
    # referenced using 'self'.
    security_group_id = aws_security_group.this.id
    function_name = local.lambda_name
  }

Hope this helps

@YakDriver
Copy link
Member

YakDriver commented Mar 9, 2023

TL;DR

The challenges with security group dependencies cannot be fixed but there are things that can help.

Intro

Thank you very much for your comments and patience on this issue over the years! 🎉

After 6 years, we are going to bring this to a resolution. Unfortunately, resolution does not mean "solution." I'll get into that below. While we cannot fully solve all the challenges, there have been improvements over time and we can make more.

"Unfixable"

Making security groups work seamlessly with other resources in all the ways practitioners expect cannot be entirely fixed as Terraform and AWS work today.

The challenges with security groups are shared by other resources with dependency relationships. However, security groups may be more tricky than most. Here are some reasons why security groups pose challenges that cannot be entirely fixed. Hopefully these help explain why this issue has been open for over 6 years.

Inherent limits

The challenges are inherent to the way security groups work. Consider this configuration:

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"
}

resource "aws_instance" "test" {
  # ... other config ...
  vpc_security_group_ids = [aws_security_group.test.id]
}

In Terraform, the aws_instance.test resource refers to the security group, aws_security_group.test. In other words, to Terraform, the resource depends on the security group. This works well because, for example, the security group is created first, before the resource, so they can be correctly associated.

However, if you want to recreate the security group, AWS prevents deletion because the security group has a dependency, the resource!

Within Terraform, the aws_security_group.test has no knowledge of aws_instance.test. Managing this would require 2-way (cyclic) dependencies. Terraform is based on a Directed Acyclic Graph (DAG), which has many advantages, but doesn't allow for modeling this type of 2-way dependency.

Sheer numbers

While it might be possible to fudge things in the code for a single complication, the number of resources that reference security groups prevents one-off fudges. Currently, there are 101 resources with an argument that refers to security groups. Adding any kind of specialized logic to aws_security_group to manage 2-way dependencies for this many resources would be difficult, error prone, and unmaintainable.

With 2-way dependency information, what is the fix?

Even assuming we were going to employ any and all means to make this work, what would that even look like?

Take, for example, the situation where you have a security group associated with an EC2 instance. To be able to delete the security group would require modifying the EC2 instance to substitute another security group, but which one? We wouldn't want the security group resource modifying the EC2 instance because it would introduce drift and/or wouldn't necessarily be under management of Terraform. This violates the design principles of Terraform.

Certainly, when something as simple as a rename causes the security group to be forcibly replaced, it would be convenient to somehow handle the association problem. At the moment, AWS has not given the world an easy way to do this.

Partial fixes and improvements

replace_triggered_by

The new meta argument element of the lifecycle block can solve this problem without provisioners. However, the big downside is that it will recreate the resource associated with the security group. For cattle infrastructures, this may be acceptable.

Adding to the example from earlier, we have this:

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"
}

resource "aws_instance" "test" {
  # ... other config ...
  vpc_security_group_ids = [aws_security_group.test.id]

  lifecycle {
    replace_triggered_by = [
      aws_security_group.test.name
    ]
  }
}

If you want to try this out, here's how you can:

  1. terraform apply
  2. Rename the security group to mysg2 and apply again
  3. It will work but replacing the EC2 instance as well.

Shorter timeout

If you are seeing long delays before security groups report that they have dependent objects, such as an EC2 instance, we recommend using a shorter timeout. Note, the default delete timeout is currently 15m. We are making changes to ensure the timeout you set is more respected. For example, to fail faster in a dependency situation, consider lowing the delete timeout to 2-5 minutes.

resource "aws_security_group" "test" {
  vpc_id      = aws_vpc.test.id
  name        = "mysg"

  timeouts {
    delete = "2m"
  }
}

Documentation

The PR that closes this issue adds significant documentation about why the problem occurs, adds examples, including with replace_triggered_by and provisioners.

@github-actions
Copy link

This functionality has been released in v4.60.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. upstream Addresses functionality related to the cloud provider.
Projects
None yet
Development

Successfully merging a pull request may close this issue.