r/security_group: Add option to forcefully revoke rules before deletion #2074

catsby · 2017-10-26T21:29:09Z

Add new revoke_rules_on_delete option for aws_security_group, which instructs the resource to delete it’s attached ingress and egress rules before attempting to delete the security group itself. Normally this isn’t required but there are some AWS services that may accept a Security Group as an input and apply rules to it outside of Terraform’s influence. Specifically, the EMR service will automatically apply rules to security groups used for the EMR Managed Security Groups, and the service will also re-apply those rules if they are removed by the API, web console, et. al. See Amazon EMR–Managed Security Groups for more information about the EMR managed security groups, specifically.

revoke_rules_on_delete is optional, with a default of false, so this extra operation is opt-in as it shouldn’t normally be needed.

This PR contains several things to this support this feature:

new revoke_rules_on_delete attribute, and documentation
tests for new revoke_rules_on_delete attribute:
- TestAccAWSSecurityGroup_forceRevokeRules_true
- TestAccAWSSecurityGroup_forceRevokeRules_false
state migration for Security Groups, to include this new default, and test
support for timeouts on Security Groups: only delete at this time
Test sweeper for VPCs and SecurityGroups; ended up not needing these, but I had already written them, and would like to build on them later
Updated docs for emr_cluster for using revoke_rules_on_delete with any security groups used in emr_cluster.emr_managed_master_security_group or emr_cluster.emr_managed_slave_security_group

This PR is a patch for issues like #1454 where users cannot destroy an environment that has an EMR cluster in it. The events there are like so:

configuration has:
- master and slave security groups
- emr_cluster with emr_managed_master_security_group and emr_managed_slave_security_group, interpolated from the above master and slave groups, respectively
Terraform creates resources successfully
EMR service injects rules into master and slave, creating a cyclic dependency; master depends on slave and visa-versa. You cannot delete either without first revoking the rules that create the dependency, which Terraform has no authority over, because Terraform sees them as computed attributes of the two respective groups
Terraform destroy can successfully destroy the cluster
Because of the cyclic dependency, Terraform cannot destroy the Security Groups (neither could the web or CLI, unless rules are revoked first)
Terraform times out in the destroy, unable to delete the Security Groups or any resource that would be deleted after.

With revoke_rules_on_delete on the master and slave groups, the EMR Cluster destroys successfully, then the rules are revoked and the groups destroy successfully.

Couldn’t users just specify the necessary rules with aws_security_group_rule resources, so Terraform could revoke them?

No; the EMR Service applies these rules itself. If a user specifies these rules and they are created before the cluster, the EMR service will likely silently fail to add those rules as they are already there. When destroying the environment, Terraform revokes the rules and the Cluster in parallel (or likely does, no guarantee). There is no dependency there; the cluster depends on the groups, the rules depend on the groups. After the rules are revoked by Terraform, EMR re-applies them. In my testing it takes ~5 minutes to destroy an EMR cluster, and it seems that even after the deletion API call is made, the EMR Service is still re-applying those rules. Terraform revokes them, but EMR restores them, and we’re stuck in the same situation. In this scenario, with revoke_rules_on_delete, the EMR cluster destroys and the EMR service no longer attempts to re-apply those rules if they are removed, but they remain, so revoke_rules_on_delete removes them first and then we destroy the groups successfully.

catsby · 2017-10-26T21:39:15Z

Test results:

==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -run=TestAccAWSSecurityGroup -timeout 120m
=== RUN   TestAccAWSSecurityGroup_importBasic
--- PASS: TestAccAWSSecurityGroup_importBasic (25.17s)
=== RUN   TestAccAWSSecurityGroup_importIpv6
--- PASS: TestAccAWSSecurityGroup_importIpv6 (25.00s)
=== RUN   TestAccAWSSecurityGroup_importSelf
--- PASS: TestAccAWSSecurityGroup_importSelf (27.08s)
=== RUN   TestAccAWSSecurityGroup_importSourceSecurityGroup
--- PASS: TestAccAWSSecurityGroup_importSourceSecurityGroup (26.26s)
=== RUN   TestAccAWSSecurityGroup_importIPRangeAndSecurityGroupWithSameRules
--- PASS: TestAccAWSSecurityGroup_importIPRangeAndSecurityGroupWithSameRules (29.07s)
=== RUN   TestAccAWSSecurityGroup_importIPRangesWithSameRules
--- PASS: TestAccAWSSecurityGroup_importIPRangesWithSameRules (26.85s)
=== RUN   TestAccAWSSecurityGroup_importPrefixList
--- PASS: TestAccAWSSecurityGroup_importPrefixList (28.75s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_VPC
--- PASS: TestAccAWSSecurityGroupRule_Ingress_VPC (25.79s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Protocol
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Protocol (23.72s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Ipv6
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Ipv6 (36.70s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Classic
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Classic (12.75s)
=== RUN   TestAccAWSSecurityGroupRule_MultiIngress
--- PASS: TestAccAWSSecurityGroupRule_MultiIngress (16.87s)
=== RUN   TestAccAWSSecurityGroupRule_Egress
--- PASS: TestAccAWSSecurityGroupRule_Egress (14.29s)
=== RUN   TestAccAWSSecurityGroupRule_SelfReference
--- PASS: TestAccAWSSecurityGroupRule_SelfReference (24.52s)
=== RUN   TestAccAWSSecurityGroupRule_ExpectInvalidTypeError
--- PASS: TestAccAWSSecurityGroupRule_ExpectInvalidTypeError (1.02s)
=== RUN   TestAccAWSSecurityGroupRule_ExpectInvalidCIDR
--- PASS: TestAccAWSSecurityGroupRule_ExpectInvalidCIDR (1.38s)
=== RUN   TestAccAWSSecurityGroupRule_PartialMatching_basic
--- PASS: TestAccAWSSecurityGroupRule_PartialMatching_basic (26.75s)
=== RUN   TestAccAWSSecurityGroupRule_PartialMatching_Source
--- PASS: TestAccAWSSecurityGroupRule_PartialMatching_Source (28.68s)
=== RUN   TestAccAWSSecurityGroupRule_Issue5310
--- PASS: TestAccAWSSecurityGroupRule_Issue5310 (12.36s)
=== RUN   TestAccAWSSecurityGroupRule_Race
--- PASS: TestAccAWSSecurityGroupRule_Race (218.04s)
=== RUN   TestAccAWSSecurityGroupRule_SelfSource
--- PASS: TestAccAWSSecurityGroupRule_SelfSource (25.16s)
=== RUN   TestAccAWSSecurityGroupRule_PrefixListEgress
--- PASS: TestAccAWSSecurityGroupRule_PrefixListEgress (26.01s)
=== RUN   TestAccAWSSecurityGroupRule_IngressDescription
--- PASS: TestAccAWSSecurityGroupRule_IngressDescription (16.05s)
=== RUN   TestAccAWSSecurityGroupRule_EgressDescription
--- PASS: TestAccAWSSecurityGroupRule_EgressDescription (15.17s)
=== RUN   TestAccAWSSecurityGroupRule_IngressDescription_updates
--- PASS: TestAccAWSSecurityGroupRule_IngressDescription_updates (22.60s)
=== RUN   TestAccAWSSecurityGroupRule_EgressDescription_updates
--- PASS: TestAccAWSSecurityGroupRule_EgressDescription_updates (23.21s)
=== RUN   TestAccAWSSecurityGroup_basic
--- PASS: TestAccAWSSecurityGroup_basic (22.21s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_true
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_true (89.02s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_false
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_false (58.12s)
=== RUN   TestAccAWSSecurityGroup_basicRuleDescription
--- PASS: TestAccAWSSecurityGroup_basicRuleDescription (24.07s)
=== RUN   TestAccAWSSecurityGroup_ipv6
--- PASS: TestAccAWSSecurityGroup_ipv6 (22.90s)
=== RUN   TestAccAWSSecurityGroup_tagsCreatedFirst
--- PASS: TestAccAWSSecurityGroup_tagsCreatedFirst (16.56s)
=== RUN   TestAccAWSSecurityGroup_namePrefix
--- PASS: TestAccAWSSecurityGroup_namePrefix (12.46s)
=== RUN   TestAccAWSSecurityGroup_self
--- PASS: TestAccAWSSecurityGroup_self (22.51s)
=== RUN   TestAccAWSSecurityGroup_vpc
--- PASS: TestAccAWSSecurityGroup_vpc (24.04s)
=== RUN   TestAccAWSSecurityGroup_vpcNegOneIngress
--- PASS: TestAccAWSSecurityGroup_vpcNegOneIngress (21.89s)
=== RUN   TestAccAWSSecurityGroup_vpcProtoNumIngress
--- PASS: TestAccAWSSecurityGroup_vpcProtoNumIngress (21.61s)
=== RUN   TestAccAWSSecurityGroup_MultiIngress
--- PASS: TestAccAWSSecurityGroup_MultiIngress (26.98s)
=== RUN   TestAccAWSSecurityGroup_Change
--- PASS: TestAccAWSSecurityGroup_Change (39.18s)
=== RUN   TestAccAWSSecurityGroup_ChangeRuleDescription
--- PASS: TestAccAWSSecurityGroup_ChangeRuleDescription (52.75s)
=== RUN   TestAccAWSSecurityGroup_generatedName
--- PASS: TestAccAWSSecurityGroup_generatedName (22.24s)
=== RUN   TestAccAWSSecurityGroup_DefaultEgress_VPC
--- PASS: TestAccAWSSecurityGroup_DefaultEgress_VPC (22.10s)
=== RUN   TestAccAWSSecurityGroup_DefaultEgress_Classic
--- PASS: TestAccAWSSecurityGroup_DefaultEgress_Classic (10.09s)
=== RUN   TestAccAWSSecurityGroup_drift
--- PASS: TestAccAWSSecurityGroup_drift (13.85s)
=== RUN   TestAccAWSSecurityGroup_drift_complex
--- PASS: TestAccAWSSecurityGroup_drift_complex (27.76s)
=== RUN   TestAccAWSSecurityGroup_invalidCIDRBlock
--- PASS: TestAccAWSSecurityGroup_invalidCIDRBlock (1.12s)
=== RUN   TestAccAWSSecurityGroup_tags
--- PASS: TestAccAWSSecurityGroup_tags (36.42s)
=== RUN   TestAccAWSSecurityGroup_CIDRandGroups
--- PASS: TestAccAWSSecurityGroup_CIDRandGroups (26.44s)
=== RUN   TestAccAWSSecurityGroup_ingressWithCidrAndSGs
--- PASS: TestAccAWSSecurityGroup_ingressWithCidrAndSGs (26.23s)
=== RUN   TestAccAWSSecurityGroup_ingressWithCidrAndSGs_classic
--- PASS: TestAccAWSSecurityGroup_ingressWithCidrAndSGs_classic (13.61s)
=== RUN   TestAccAWSSecurityGroup_egressWithPrefixList
--- PASS: TestAccAWSSecurityGroup_egressWithPrefixList (26.85s)
=== RUN   TestAccAWSSecurityGroup_ipv4andipv6Egress
--- PASS: TestAccAWSSecurityGroup_ipv4andipv6Egress (22.74s)
=== RUN   TestAccAWSSecurityGroup_failWithDiffMismatch
--- PASS: TestAccAWSSecurityGroup_failWithDiffMismatch (26.40s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       1489.436s

radeksimko

Thanks for debugging this! 👍

Just for transparency - as discussed via Slack on Friday, there's no way to identify rules created by EMR, so there's no better way to approach this.

My only two questions:

Do we really need the customizable timeout? It suggests that to work around the EMR problem user not only needs to set revoke_rules_on_delete but also figure out if the default timeout is sufficient. I think that for operation like this Terraform should have sufficiently high default timeout that nobody needs to tune it. Customizable timeouts are IMO useful for things like higher instance/disk sizes, where it's obvious that the user is doing something unusual (spinning up unusually big instance) and that will naturally take more time than usual.
Are those VPC sweepers actually going to work at this point? I thought we discussed earlier that VPC should come at last as one cannot delete a VPC before deleting all the resources within (subnets, IGWs, route tables, instances, etc.).

catsby · 2017-10-30T14:12:32Z

Hey @radeksimko !

Do we really need the customizable timeout? It suggests that to work around the EMR problem user not only needs to set revoke_rules_on_delete

The timeouts added to aws_security_group are basically unrelated. I added them to improve testing time because the normal timeout was 5 minutes to hit this race/block condition. I generally don't see harm in adding timeouts to resources, but if you strongly oppose I can remove it 😄

Are those VPC sweepers actually going to work at this point?

Yes. At this point they are scoped so that they would only ever destroy a VPC created in this test, to my knowledge. You are correct that destroying VPCs in the context of sweepers is hard, but I'd like to get it started. You see here that we depend on sweeping SecurityGroups, also tightly scoped.

I thought we discussed earlier that VPC should come at last

My concern is that "at last" will come way to late. I'm starting small, destroying a set of VPCs that are "known" to only have a few conflicting things, in a certain scenario. And even then, maybe not 100%, but I want to start small. I'm ok removing them if you'd like. Also as I mentioned:

Test sweeper for VPCs and SecurityGroups; ended up not needing these, but I had already written them, and would like to build on them later

Originally I started with an acceptance test that would fail, and I wanted leaked resources, to full reproduce what was happening. I wrote sweepers to clean up the mess I left. I misunderstood ExpectError in the tests and so couldn't do it that way, so ended up redoing the tests so that we didn't end up with leaks. I left the sweepers in and thought we should start small with VPC sweepers, if we could. Maybe they aren't very useful as they are, but they lay a foundation.. or so I though 😄

catsby · 2017-10-30T14:45:43Z

In ec868d3 I removed the timeouts support on aws_security_group. It's unrelated to the fix here and was discussed internally as best to omit for now.

==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -run=TestAccAWSSecurityGroup_forceRevokeRules_ -timeout 120m
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_true
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_true (381.81s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_false
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_false (353.18s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       735.036s

radeksimko

LGTM, assuming green Travis.

copumpkin · 2018-08-01T02:37:43Z

@catsby trying to figure out if your change to the documentation here is correct. The revoke_rules_on_delete attribute is a great addition, but the previous advice to add depends_on still seems necessary because otherwise deletion can race with EMR which might try to recreate the rules as terraform deletes them.

Or at least, all three of my EMR SGs have revoke_rules_on_delete = true and deletion still chugs along forever.

Edit: I guess depends_on doesn't help here either.

copumpkin · 2018-08-01T11:58:55Z

With revoke_rules_on_delete on the master and slave groups, the EMR Cluster destroys successfully, then the rules are revoked and the groups destroy successfully.

Oh, for what it's worth, the issue for me is that terraform gets stuck deleting the service SG for their ENI in private subnets, which probably means I should have a depends_on for that. This stuff is so easy to screw up 😦

copumpkin · 2018-08-01T12:13:10Z

Sorry for the noise, I described my more complete issue in #5413

ghost · 2020-04-04T17:27:06Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

catsby added 3 commits October 26, 2017 13:31

initial work for force revoking rules, basic vpc and sg sweeper

9d28e8a

add docs, migration

73ed79a

huh

80291d7

catsby requested a review from radeksimko October 26, 2017 21:29

catsby mentioned this pull request Oct 26, 2017

Cannot destroy security groups if they've been attached to an EMR cluster #1454

Closed

radeksimko added the bug Addresses a defect in current functionality. label Oct 27, 2017

radeksimko added this to the v1.2.0 milestone Oct 27, 2017

radeksimko reviewed Oct 30, 2017

View reviewed changes

remove timeout for security groups

ec868d3

radeksimko approved these changes Oct 30, 2017

View reviewed changes

catsby merged commit ddb1d58 into master Oct 30, 2017

catsby deleted the f-sg-force-revoke branch October 30, 2017 14:58

catsby mentioned this pull request Nov 16, 2017

aws_security_group: DependencyViolation: resource sg-XXX has a dependent object #1671

Closed

copumpkin mentioned this pull request Aug 1, 2018

Security group woes launching EMR into a private subnet #5413

Closed

ghost locked and limited conversation to collaborators Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r/security_group: Add option to forcefully revoke rules before deletion #2074

r/security_group: Add option to forcefully revoke rules before deletion #2074

catsby commented Oct 26, 2017

catsby commented Oct 26, 2017

radeksimko left a comment •

edited

Loading

catsby commented Oct 30, 2017

catsby commented Oct 30, 2017

radeksimko left a comment

copumpkin commented Aug 1, 2018 •

edited

Loading

copumpkin commented Aug 1, 2018

copumpkin commented Aug 1, 2018

ghost commented Apr 4, 2020

r/security_group: Add option to forcefully revoke rules before deletion #2074

r/security_group: Add option to forcefully revoke rules before deletion #2074

Conversation

catsby commented Oct 26, 2017

catsby commented Oct 26, 2017

radeksimko left a comment • edited Loading

Choose a reason for hiding this comment

catsby commented Oct 30, 2017

catsby commented Oct 30, 2017

radeksimko left a comment

Choose a reason for hiding this comment

copumpkin commented Aug 1, 2018 • edited Loading

copumpkin commented Aug 1, 2018

copumpkin commented Aug 1, 2018

ghost commented Apr 4, 2020

radeksimko left a comment •

edited

Loading

copumpkin commented Aug 1, 2018 •

edited

Loading