Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

bitglue · 2015-04-02T20:35:22Z

This reverts commit 3d80057.

Rationale: There is no guarantee that waiting will resolve the
DependencyViolation. It's possible, perhaps likely, that the thing
preventing the subnet deletion was created by some other means -- by
another user, by a dynamic application, in another manifest, or that it
was created by this manifest but for some other reason failed to be
deleted.

In these cases, the retry behavior makes the user wait 5 minutes only to
receive the misleading error:

Error deleting subnet: timeout while waiting for state to become
'destroyed'

The obvious response to this is to try again, which yields another 5
minutes of waiting.

The previous behavior was to fail quickly with an error which exactly
described the problem. While this is mildly annoying, it's at least
apparent what is happening.

The situation the original commit was intended to address could be more
elegantly addressed by:

running Terraform a second time, or
explicitly declaring dependencies via the depends_on attribute.

catsby · 2015-04-02T21:44:44Z

Hey @bitglue – thanks for the contribution. I'm sorry 3d80057 caused you so much trouble.

We've found the pattern of retrying a deletion in the event of a dependency violation to be helpful in a handful of places, this one specifically. We felt it was a more fault-tolarent approach that minimized the amount of manual depends_on relationships users had to use, and the graph had to manage.

Of course, perhaps I was too zealous to introduce that here, which we can discuss if that's the case.

Was the error swallowed up completely? E.g. Terraform made no mention to you of a dependency violation? If so, do you think ensuring we bubble up that error, and perhaps use a shorter wait window (1, maybe 2 minutes), would that be acceptable?

bitglue · 2015-04-02T21:49:30Z

I'd also share some relevant experience from Puppet here. Puppet has autorequire, where it tries to add implicit dependencies based on rules specific to each resource.

The trouble is, I'm never sure if a particular thing would autorequire or not. In theory, I can look at the DAG or read the documentation to confirm that a thing does autorequire like I hope it does, but in practice that's more work than just typing the dependency. Besides, if it's not plainly obvious from reading the code how something will behave, I won't do it. That trouble I have in figuring it out is also a problem for everyone who reads the code in the future.

Consequently, I just never used autorequire knowingly.

Terraform is a lot better than Puppet in this respect because it automatically adds dependencies based on what variables I reference. This covers at least 95% of cases. It makes logical sense and it's explicit without being verbose. Puppet doesn't do that...if you reference a variable that isn't defined yet because you had the dependencies wrong, it just helpfully interpolates an empty string for you. 😠 As such, most of the reason why Puppet needs autorequire is really addressed by Terraform's implicit-dependency-on-interpolation logic.

bitglue · 2015-04-02T21:54:21Z

@catsby Yeah, the error is swallowed completely. See my previous comment, which I was writing while you were writing yours.

Maybe if the wait window was like 5 seconds, it would be acceptablish. But really, I just don't like this kind of behavior. There are a lot of Terraform errors that can be corrected by retrying them, but that's not the right behavior. Explicit is better than implicit. See principle of least astonishment and DWIM.

phinze · 2015-04-02T22:30:42Z

Hey @pmoust - thanks for this feedback.

The puppet autorequire anecdote is especially useful since we've been talking internally about how to support provider-defined implicit dependencies, which is something we'd like to do eventually.

I think you're absolutely right that non-obvious implicit behavior often leads to confusion and sadness, and Terraform's style is definitely to keep thinks as explicit and obvious as possible. Perhaps this means that when Terraform providers gain knowledge of implicit dependencies, they simply use it to generate validation errors pointing out to the user how the configuration needs to be corrected. We'll see!

Ultimately the change in question here is purely a stopgap, because Terraform core does not expose the proper tools to the provider to be able to express the correct behavior. So the question is "what is the best behavior in the interim?"

Give us a chance to talk it over and we'll get back to you soon.

bitglue · 2015-04-09T14:20:05Z

I've rebased this onto the current master. Had to resolve some conflicts due to this function moving to the non-vendored AWS SDK.

@phinze any thoughts on this issue? I continue to encounter it multiple times per week (when I'm not running my build which includes this PR), and it's extremely annoying. Even though I know what the problem is now, I'm faced with 2 options:

wait helplessly for five minutes, then manually go into the web console to figure out what the problem is
give it a double ^C and risk hosing the state file by entire zombie deployments can be caused by external failures #694.

Further compounding this issue is #922 which means the thing I'm reverting here doesn't even work as intended. If the subnet, and the instances (or whatever) in the subnet are defined in a module, then Terraform walks the destroy graph backwards, so you can wait forever and still the instances in the subnet won't be deleted. Terraform won't even try to delete the instances in the subnet because it thinks it needs to delete the subnet before it can delete the instances.

catsby · 2015-04-09T17:00:30Z

Hey @bitglue – I've submitted #1455 as a proposal to address your pain but also satisfy others who benefit from this retry logic.

Personally, I'd like to apologize for causing you this grief! Obviously I only meant to improve things 😄.

Regarding DWIM, I thought I was… A user asked Terraform to delete a thing or things, and it did. The implementation of how it does shouldn’t be so important, or so I believe(d). That said, completely hiding the true error, which was unintentional, and under a 5 timeout was both bad and incorrect, so I apologize for the lost time and frustration.

Please check out #1455 and let me know what you think, and thanks for your continued contributions to Terraform 😄

…ity groups" This reverts commit 3d80057. Rationale: There is no guarantee that waiting will resolve the DependencyViolation. It's possible, perhaps likely, that the thing preventing the subnet deletion was created by some other means -- by another user, by a dynamic application, in another manifest, or that it was created by this manifest but for some other reason failed to be deleted. In these cases, the retry behavior makes the user wait 5 minutes only to receive the misleading error: Error deleting subnet: timeout while waiting for state to become 'destroyed' The obvious response to this is to try again, which yields another 5 minutes of waiting. The previous behavior was to fail quickly with an error which exactly described the problem. While this is mildly annoying, it's at least apparent what is happening. The situation the original commit was intended to address could be more elegantly addressed by: - running Terraform a second time, or - explicitly declaring dependencies via the `depends_on` attribute.

Related: commit 7c89fb0.

Logically reverts 043a484.

catsby · 2015-05-05T16:25:23Z

Hey @bitglue

Thanks for offering this pull request, but we’ve decided to decline. While I admit that simply retrying over and over is not the perfect solution, I’m still convinced that this limited retrying is the correct behavior at this time.

We are hoping that recent enhancements with modules can help alleviate the pain you’ve felt here, and we’re still looking for the right solution overall, but for now retrying for a reasonable amount of time is the desired behavior. We’ve found that in more situations than not, a simple retry (or 2) will address several eventually consistent type errors we’ve experienced with AWS.

I agree with you however that 5 minutes it too long. I’ve lowered that to 2 minutes in commit 2724153. In my experience, this type of dependency violation is typically cleared up in under 1 minute. I understand that some situations like the one you describe will never clear up, but we still felt 2 minutes was a reasonable amount of time.

Thank you again.

bitglue · 2015-05-05T16:35:03Z

@catsby If that's the case, you might want to look at #1732.

fatih · 2015-10-12T10:12:47Z

We also hit the same problem. There is an error DependencyViolation, but Terraform keeps trying to detach the internet gateway for 5 minutes and just times out. The reason of course is because they are other resources, which are not defined in Terraform, but uses the resources here (in our case internet_gateway. )

I'm now going to fork Terraform and probably use a custom implementation until we find a better solution. But I don't know why there is no error returned, even though there is an error? Do I miss something here?

ghost · 2020-04-30T02:28:13Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

bitglue mentioned this pull request Apr 2, 2015

provider/aws: Fix dependency violation with subnets and security groups #1252

Merged

catsby added bug provider/aws labels Apr 2, 2015

bitglue force-pushed the delete_subnets branch 2 times, most recently from 026350b to 1bab291 Compare April 9, 2015 14:12

catsby mentioned this pull request Apr 9, 2015

provider/aws: Limit the number of retry attempts when deleting a Subnet #1455

Closed

bitglue force-pushed the delete_subnets branch 2 times, most recently from ec38c86 to 81019d2 Compare April 24, 2015 12:18

bitglue referenced this pull request Apr 24, 2015

provider/aws: Fix dependency violation when deleting Internet Gateways

043a484

bitglue force-pushed the delete_subnets branch from 81019d2 to c9dbce5 Compare April 28, 2015 16:02

Phil Frost added 2 commits April 28, 2015 12:42

Don't retry deleting security groups on DependencyViolation

8426e32

Related: commit 7c89fb0.

Don't retry on DependencyViolation for internet gateways

0efa3e6

Logically reverts 043a484.

bitglue force-pushed the delete_subnets branch from c9dbce5 to 0efa3e6 Compare April 28, 2015 16:42

bitglue mentioned this pull request Apr 29, 2015

thinking: dependency graphs are hard, so just don't do them #1732

Closed

catsby closed this May 5, 2015

fatih mentioned this pull request Oct 16, 2015

aws: do not timeout if there is a violation, instead just return an koding/terraform#1

Merged

ghost locked and limited conversation to collaborators Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

bitglue commented Apr 2, 2015

catsby commented Apr 2, 2015

bitglue commented Apr 2, 2015

bitglue commented Apr 2, 2015

phinze commented Apr 2, 2015

bitglue commented Apr 9, 2015

catsby commented Apr 9, 2015

catsby commented May 5, 2015

bitglue commented May 5, 2015

fatih commented Oct 12, 2015

ghost commented Apr 30, 2020

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

Conversation

bitglue commented Apr 2, 2015

catsby commented Apr 2, 2015

bitglue commented Apr 2, 2015

bitglue commented Apr 2, 2015

phinze commented Apr 2, 2015

bitglue commented Apr 9, 2015

catsby commented Apr 9, 2015

catsby commented May 5, 2015

bitglue commented May 5, 2015

fatih commented Oct 12, 2015

ghost commented Apr 30, 2020