Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/security_group: Add option to forcefully revoke rules before deletion #2074

Merged
merged 4 commits into from
Oct 30, 2017

Conversation

catsby
Copy link
Contributor

@catsby catsby commented Oct 26, 2017

Add new revoke_rules_on_delete option for aws_security_group, which instructs the resource to delete it’s attached ingress and egress rules before attempting to delete the security group itself. Normally this isn’t required but there are some AWS services that may accept a Security Group as an input and apply rules to it outside of Terraform’s influence. Specifically, the EMR service will automatically apply rules to security groups used for the EMR Managed Security Groups, and the service will also re-apply those rules if they are removed by the API, web console, et. al. See Amazon EMR–Managed Security Groups for more information about the EMR managed security groups, specifically.

revoke_rules_on_delete is optional, with a default of false, so this extra operation is opt-in as it shouldn’t normally be needed.

This PR contains several things to this support this feature:

  • new revoke_rules_on_delete attribute, and documentation
  • tests for new revoke_rules_on_delete attribute:
    • TestAccAWSSecurityGroup_forceRevokeRules_true
    • TestAccAWSSecurityGroup_forceRevokeRules_false
  • state migration for Security Groups, to include this new default, and test
  • support for timeouts on Security Groups: only delete at this time
  • Test sweeper for VPCs and SecurityGroups; ended up not needing these, but I had already written them, and would like to build on them later
  • Updated docs for emr_cluster for using revoke_rules_on_delete with any security groups used in emr_cluster.emr_managed_master_security_group or emr_cluster.emr_managed_slave_security_group

This PR is a patch for issues like #1454 where users cannot destroy an environment that has an EMR cluster in it. The events there are like so:

  • configuration has:
    • master and slave security groups
    • emr_cluster with emr_managed_master_security_group and emr_managed_slave_security_group, interpolated from the above master and slave groups, respectively
  • Terraform creates resources successfully
  • EMR service injects rules into master and slave, creating a cyclic dependency; master depends on slave and visa-versa. You cannot delete either without first revoking the rules that create the dependency, which Terraform has no authority over, because Terraform sees them as computed attributes of the two respective groups
  • Terraform destroy can successfully destroy the cluster
  • Because of the cyclic dependency, Terraform cannot destroy the Security Groups (neither could the web or CLI, unless rules are revoked first)
  • Terraform times out in the destroy, unable to delete the Security Groups or any resource that would be deleted after.

With revoke_rules_on_delete on the master and slave groups, the EMR Cluster destroys successfully, then the rules are revoked and the groups destroy successfully.


Couldn’t users just specify the necessary rules with aws_security_group_rule resources, so Terraform could revoke them?

No; the EMR Service applies these rules itself. If a user specifies these rules and they are created before the cluster, the EMR service will likely silently fail to add those rules as they are already there. When destroying the environment, Terraform revokes the rules and the Cluster in parallel (or likely does, no guarantee). There is no dependency there; the cluster depends on the groups, the rules depend on the groups. After the rules are revoked by Terraform, EMR re-applies them. In my testing it takes ~5 minutes to destroy an EMR cluster, and it seems that even after the deletion API call is made, the EMR Service is still re-applying those rules. Terraform revokes them, but EMR restores them, and we’re stuck in the same situation. In this scenario, with revoke_rules_on_delete, the EMR cluster destroys and the EMR service no longer attempts to re-apply those rules if they are removed, but they remain, so revoke_rules_on_delete removes them first and then we destroy the groups successfully.

@catsby
Copy link
Contributor Author

catsby commented Oct 26, 2017

Test results:

==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -run=TestAccAWSSecurityGroup -timeout 120m
=== RUN   TestAccAWSSecurityGroup_importBasic
--- PASS: TestAccAWSSecurityGroup_importBasic (25.17s)
=== RUN   TestAccAWSSecurityGroup_importIpv6
--- PASS: TestAccAWSSecurityGroup_importIpv6 (25.00s)
=== RUN   TestAccAWSSecurityGroup_importSelf
--- PASS: TestAccAWSSecurityGroup_importSelf (27.08s)
=== RUN   TestAccAWSSecurityGroup_importSourceSecurityGroup
--- PASS: TestAccAWSSecurityGroup_importSourceSecurityGroup (26.26s)
=== RUN   TestAccAWSSecurityGroup_importIPRangeAndSecurityGroupWithSameRules
--- PASS: TestAccAWSSecurityGroup_importIPRangeAndSecurityGroupWithSameRules (29.07s)
=== RUN   TestAccAWSSecurityGroup_importIPRangesWithSameRules
--- PASS: TestAccAWSSecurityGroup_importIPRangesWithSameRules (26.85s)
=== RUN   TestAccAWSSecurityGroup_importPrefixList
--- PASS: TestAccAWSSecurityGroup_importPrefixList (28.75s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_VPC
--- PASS: TestAccAWSSecurityGroupRule_Ingress_VPC (25.79s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Protocol
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Protocol (23.72s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Ipv6
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Ipv6 (36.70s)
=== RUN   TestAccAWSSecurityGroupRule_Ingress_Classic
--- PASS: TestAccAWSSecurityGroupRule_Ingress_Classic (12.75s)
=== RUN   TestAccAWSSecurityGroupRule_MultiIngress
--- PASS: TestAccAWSSecurityGroupRule_MultiIngress (16.87s)
=== RUN   TestAccAWSSecurityGroupRule_Egress
--- PASS: TestAccAWSSecurityGroupRule_Egress (14.29s)
=== RUN   TestAccAWSSecurityGroupRule_SelfReference
--- PASS: TestAccAWSSecurityGroupRule_SelfReference (24.52s)
=== RUN   TestAccAWSSecurityGroupRule_ExpectInvalidTypeError
--- PASS: TestAccAWSSecurityGroupRule_ExpectInvalidTypeError (1.02s)
=== RUN   TestAccAWSSecurityGroupRule_ExpectInvalidCIDR
--- PASS: TestAccAWSSecurityGroupRule_ExpectInvalidCIDR (1.38s)
=== RUN   TestAccAWSSecurityGroupRule_PartialMatching_basic
--- PASS: TestAccAWSSecurityGroupRule_PartialMatching_basic (26.75s)
=== RUN   TestAccAWSSecurityGroupRule_PartialMatching_Source
--- PASS: TestAccAWSSecurityGroupRule_PartialMatching_Source (28.68s)
=== RUN   TestAccAWSSecurityGroupRule_Issue5310
--- PASS: TestAccAWSSecurityGroupRule_Issue5310 (12.36s)
=== RUN   TestAccAWSSecurityGroupRule_Race
--- PASS: TestAccAWSSecurityGroupRule_Race (218.04s)
=== RUN   TestAccAWSSecurityGroupRule_SelfSource
--- PASS: TestAccAWSSecurityGroupRule_SelfSource (25.16s)
=== RUN   TestAccAWSSecurityGroupRule_PrefixListEgress
--- PASS: TestAccAWSSecurityGroupRule_PrefixListEgress (26.01s)
=== RUN   TestAccAWSSecurityGroupRule_IngressDescription
--- PASS: TestAccAWSSecurityGroupRule_IngressDescription (16.05s)
=== RUN   TestAccAWSSecurityGroupRule_EgressDescription
--- PASS: TestAccAWSSecurityGroupRule_EgressDescription (15.17s)
=== RUN   TestAccAWSSecurityGroupRule_IngressDescription_updates
--- PASS: TestAccAWSSecurityGroupRule_IngressDescription_updates (22.60s)
=== RUN   TestAccAWSSecurityGroupRule_EgressDescription_updates
--- PASS: TestAccAWSSecurityGroupRule_EgressDescription_updates (23.21s)
=== RUN   TestAccAWSSecurityGroup_basic
--- PASS: TestAccAWSSecurityGroup_basic (22.21s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_true
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_true (89.02s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_false
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_false (58.12s)
=== RUN   TestAccAWSSecurityGroup_basicRuleDescription
--- PASS: TestAccAWSSecurityGroup_basicRuleDescription (24.07s)
=== RUN   TestAccAWSSecurityGroup_ipv6
--- PASS: TestAccAWSSecurityGroup_ipv6 (22.90s)
=== RUN   TestAccAWSSecurityGroup_tagsCreatedFirst
--- PASS: TestAccAWSSecurityGroup_tagsCreatedFirst (16.56s)
=== RUN   TestAccAWSSecurityGroup_namePrefix
--- PASS: TestAccAWSSecurityGroup_namePrefix (12.46s)
=== RUN   TestAccAWSSecurityGroup_self
--- PASS: TestAccAWSSecurityGroup_self (22.51s)
=== RUN   TestAccAWSSecurityGroup_vpc
--- PASS: TestAccAWSSecurityGroup_vpc (24.04s)
=== RUN   TestAccAWSSecurityGroup_vpcNegOneIngress
--- PASS: TestAccAWSSecurityGroup_vpcNegOneIngress (21.89s)
=== RUN   TestAccAWSSecurityGroup_vpcProtoNumIngress
--- PASS: TestAccAWSSecurityGroup_vpcProtoNumIngress (21.61s)
=== RUN   TestAccAWSSecurityGroup_MultiIngress
--- PASS: TestAccAWSSecurityGroup_MultiIngress (26.98s)
=== RUN   TestAccAWSSecurityGroup_Change
--- PASS: TestAccAWSSecurityGroup_Change (39.18s)
=== RUN   TestAccAWSSecurityGroup_ChangeRuleDescription
--- PASS: TestAccAWSSecurityGroup_ChangeRuleDescription (52.75s)
=== RUN   TestAccAWSSecurityGroup_generatedName
--- PASS: TestAccAWSSecurityGroup_generatedName (22.24s)
=== RUN   TestAccAWSSecurityGroup_DefaultEgress_VPC
--- PASS: TestAccAWSSecurityGroup_DefaultEgress_VPC (22.10s)
=== RUN   TestAccAWSSecurityGroup_DefaultEgress_Classic
--- PASS: TestAccAWSSecurityGroup_DefaultEgress_Classic (10.09s)
=== RUN   TestAccAWSSecurityGroup_drift
--- PASS: TestAccAWSSecurityGroup_drift (13.85s)
=== RUN   TestAccAWSSecurityGroup_drift_complex
--- PASS: TestAccAWSSecurityGroup_drift_complex (27.76s)
=== RUN   TestAccAWSSecurityGroup_invalidCIDRBlock
--- PASS: TestAccAWSSecurityGroup_invalidCIDRBlock (1.12s)
=== RUN   TestAccAWSSecurityGroup_tags
--- PASS: TestAccAWSSecurityGroup_tags (36.42s)
=== RUN   TestAccAWSSecurityGroup_CIDRandGroups
--- PASS: TestAccAWSSecurityGroup_CIDRandGroups (26.44s)
=== RUN   TestAccAWSSecurityGroup_ingressWithCidrAndSGs
--- PASS: TestAccAWSSecurityGroup_ingressWithCidrAndSGs (26.23s)
=== RUN   TestAccAWSSecurityGroup_ingressWithCidrAndSGs_classic
--- PASS: TestAccAWSSecurityGroup_ingressWithCidrAndSGs_classic (13.61s)
=== RUN   TestAccAWSSecurityGroup_egressWithPrefixList
--- PASS: TestAccAWSSecurityGroup_egressWithPrefixList (26.85s)
=== RUN   TestAccAWSSecurityGroup_ipv4andipv6Egress
--- PASS: TestAccAWSSecurityGroup_ipv4andipv6Egress (22.74s)
=== RUN   TestAccAWSSecurityGroup_failWithDiffMismatch
--- PASS: TestAccAWSSecurityGroup_failWithDiffMismatch (26.40s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       1489.436s

@radeksimko radeksimko added the bug Addresses a defect in current functionality. label Oct 27, 2017
@radeksimko radeksimko added this to the v1.2.0 milestone Oct 27, 2017
Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this! 👍

Just for transparency - as discussed via Slack on Friday, there's no way to identify rules created by EMR, so there's no better way to approach this.

My only two questions:

  • Do we really need the customizable timeout? It suggests that to work around the EMR problem user not only needs to set revoke_rules_on_delete but also figure out if the default timeout is sufficient. I think that for operation like this Terraform should have sufficiently high default timeout that nobody needs to tune it. Customizable timeouts are IMO useful for things like higher instance/disk sizes, where it's obvious that the user is doing something unusual (spinning up unusually big instance) and that will naturally take more time than usual.

  • Are those VPC sweepers actually going to work at this point? I thought we discussed earlier that VPC should come at last as one cannot delete a VPC before deleting all the resources within (subnets, IGWs, route tables, instances, etc.).

@catsby
Copy link
Contributor Author

catsby commented Oct 30, 2017

Hey @radeksimko !

Do we really need the customizable timeout? It suggests that to work around the EMR problem user not only needs to set revoke_rules_on_delete

The timeouts added to aws_security_group are basically unrelated. I added them to improve testing time because the normal timeout was 5 minutes to hit this race/block condition. I generally don't see harm in adding timeouts to resources, but if you strongly oppose I can remove it 😄

Are those VPC sweepers actually going to work at this point?

Yes. At this point they are scoped so that they would only ever destroy a VPC created in this test, to my knowledge. You are correct that destroying VPCs in the context of sweepers is hard, but I'd like to get it started. You see here that we depend on sweeping SecurityGroups, also tightly scoped.

I thought we discussed earlier that VPC should come at last

My concern is that "at last" will come way to late. I'm starting small, destroying a set of VPCs that are "known" to only have a few conflicting things, in a certain scenario. And even then, maybe not 100%, but I want to start small. I'm ok removing them if you'd like. Also as I mentioned:

Test sweeper for VPCs and SecurityGroups; ended up not needing these, but I had already written them, and would like to build on them later

Originally I started with an acceptance test that would fail, and I wanted leaked resources, to full reproduce what was happening. I wrote sweepers to clean up the mess I left. I misunderstood ExpectError in the tests and so couldn't do it that way, so ended up redoing the tests so that we didn't end up with leaks. I left the sweepers in and thought we should start small with VPC sweepers, if we could. Maybe they aren't very useful as they are, but they lay a foundation.. or so I though 😄

@catsby
Copy link
Contributor Author

catsby commented Oct 30, 2017

In ec868d3 I removed the timeouts support on aws_security_group. It's unrelated to the fix here and was discussed internally as best to omit for now.

==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws -v -run=TestAccAWSSecurityGroup_forceRevokeRules_ -timeout 120m
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_true
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_true (381.81s)
=== RUN   TestAccAWSSecurityGroup_forceRevokeRules_false
--- PASS: TestAccAWSSecurityGroup_forceRevokeRules_false (353.18s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       735.036s

Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming green Travis.

@copumpkin
Copy link

copumpkin commented Aug 1, 2018

@catsby trying to figure out if your change to the documentation here is correct. The revoke_rules_on_delete attribute is a great addition, but the previous advice to add depends_on still seems necessary because otherwise deletion can race with EMR which might try to recreate the rules as terraform deletes them.

Or at least, all three of my EMR SGs have revoke_rules_on_delete = true and deletion still chugs along forever.

Edit: I guess depends_on doesn't help here either.

@copumpkin
Copy link

With revoke_rules_on_delete on the master and slave groups, the EMR Cluster destroys successfully, then the rules are revoked and the groups destroy successfully.

Oh, for what it's worth, the issue for me is that terraform gets stuck deleting the service SG for their ENI in private subnets, which probably means I should have a depends_on for that. This stuff is so easy to screw up 😦

@copumpkin
Copy link

Sorry for the noise, I described my more complete issue in #5413

@ghost
Copy link

ghost commented Apr 4, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants