Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

Closed
wants to merge 3 commits into from

Conversation

bitglue
Copy link

@bitglue bitglue commented Apr 2, 2015

This reverts commit 3d80057.

Rationale: There is no guarantee that waiting will resolve the
DependencyViolation. It's possible, perhaps likely, that the thing
preventing the subnet deletion was created by some other means -- by
another user, by a dynamic application, in another manifest, or that it
was created by this manifest but for some other reason failed to be
deleted.

In these cases, the retry behavior makes the user wait 5 minutes only to
receive the misleading error:

Error deleting subnet: timeout while waiting for state to become
'destroyed'

The obvious response to this is to try again, which yields another 5
minutes of waiting.

The previous behavior was to fail quickly with an error which exactly
described the problem. While this is mildly annoying, it's at least
apparent what is happening.

The situation the original commit was intended to address could be more
elegantly addressed by:

  • running Terraform a second time, or
  • explicitly declaring dependencies via the depends_on attribute.

@catsby
Copy link
Contributor

catsby commented Apr 2, 2015

Hey @bitglue – thanks for the contribution. I'm sorry 3d80057 caused you so much trouble.

We've found the pattern of retrying a deletion in the event of a dependency violation to be helpful in a handful of places, this one specifically. We felt it was a more fault-tolarent approach that minimized the amount of manual depends_on relationships users had to use, and the graph had to manage.

Of course, perhaps I was too zealous to introduce that here, which we can discuss if that's the case.

Was the error swallowed up completely? E.g. Terraform made no mention to you of a dependency violation? If so, do you think ensuring we bubble up that error, and perhaps use a shorter wait window (1, maybe 2 minutes), would that be acceptable?

@bitglue
Copy link
Author

bitglue commented Apr 2, 2015

I'd also share some relevant experience from Puppet here. Puppet has autorequire, where it tries to add implicit dependencies based on rules specific to each resource.

The trouble is, I'm never sure if a particular thing would autorequire or not. In theory, I can look at the DAG or read the documentation to confirm that a thing does autorequire like I hope it does, but in practice that's more work than just typing the dependency. Besides, if it's not plainly obvious from reading the code how something will behave, I won't do it. That trouble I have in figuring it out is also a problem for everyone who reads the code in the future.

Consequently, I just never used autorequire knowingly.

Terraform is a lot better than Puppet in this respect because it automatically adds dependencies based on what variables I reference. This covers at least 95% of cases. It makes logical sense and it's explicit without being verbose. Puppet doesn't do that...if you reference a variable that isn't defined yet because you had the dependencies wrong, it just helpfully interpolates an empty string for you. 😠 As such, most of the reason why Puppet needs autorequire is really addressed by Terraform's implicit-dependency-on-interpolation logic.

@bitglue
Copy link
Author

bitglue commented Apr 2, 2015

@catsby Yeah, the error is swallowed completely. See my previous comment, which I was writing while you were writing yours.

Maybe if the wait window was like 5 seconds, it would be acceptablish. But really, I just don't like this kind of behavior. There are a lot of Terraform errors that can be corrected by retrying them, but that's not the right behavior. Explicit is better than implicit. See principle of least astonishment and DWIM.

@phinze
Copy link
Contributor

phinze commented Apr 2, 2015

Hey @pmoust - thanks for this feedback.

The puppet autorequire anecdote is especially useful since we've been talking internally about how to support provider-defined implicit dependencies, which is something we'd like to do eventually.

I think you're absolutely right that non-obvious implicit behavior often leads to confusion and sadness, and Terraform's style is definitely to keep thinks as explicit and obvious as possible. Perhaps this means that when Terraform providers gain knowledge of implicit dependencies, they simply use it to generate validation errors pointing out to the user how the configuration needs to be corrected. We'll see!

Ultimately the change in question here is purely a stopgap, because Terraform core does not expose the proper tools to the provider to be able to express the correct behavior. So the question is "what is the best behavior in the interim?"

Give us a chance to talk it over and we'll get back to you soon.

@bitglue bitglue force-pushed the delete_subnets branch 2 times, most recently from 026350b to 1bab291 Compare April 9, 2015 14:12
@bitglue
Copy link
Author

bitglue commented Apr 9, 2015

I've rebased this onto the current master. Had to resolve some conflicts due to this function moving to the non-vendored AWS SDK.

@phinze any thoughts on this issue? I continue to encounter it multiple times per week (when I'm not running my build which includes this PR), and it's extremely annoying. Even though I know what the problem is now, I'm faced with 2 options:

  1. wait helplessly for five minutes, then manually go into the web console to figure out what the problem is
  2. give it a double ^C and risk hosing the state file by entire zombie deployments can be caused by external failures #694.

Further compounding this issue is #922 which means the thing I'm reverting here doesn't even work as intended. If the subnet, and the instances (or whatever) in the subnet are defined in a module, then Terraform walks the destroy graph backwards, so you can wait forever and still the instances in the subnet won't be deleted. Terraform won't even try to delete the instances in the subnet because it thinks it needs to delete the subnet before it can delete the instances.

@catsby
Copy link
Contributor

catsby commented Apr 9, 2015

Hey @bitglue – I've submitted #1455 as a proposal to address your pain but also satisfy others who benefit from this retry logic.

Personally, I'd like to apologize for causing you this grief! Obviously I only meant to improve things 😄.

Regarding DWIM, I thought I was… A user asked Terraform to delete a thing or things, and it did. The implementation of how it does shouldn’t be so important, or so I believe(d). That said, completely hiding the true error, which was unintentional, and under a 5 timeout was both bad and incorrect, so I apologize for the lost time and frustration.

Please check out #1455 and let me know what you think, and thanks for your continued contributions to Terraform 😄

…ity groups"

This reverts commit 3d80057.

Rationale: There is no guarantee that waiting will resolve the
DependencyViolation. It's possible, perhaps likely, that the thing
preventing the subnet deletion was created by some other means -- by
another user, by a dynamic application, in another manifest, or that it
was created by this manifest but for some other reason failed to be
deleted.

In these cases, the retry behavior makes the user wait 5 minutes only to
receive the misleading error:

Error deleting subnet: timeout while waiting for state to become
'destroyed'

The obvious response to this is to try again, which yields another 5
minutes of waiting.

The previous behavior was to fail quickly with an error which exactly
described the problem. While this is mildly annoying, it's at least
apparent what is happening.

The situation the original commit was intended to address could be more
elegantly addressed by:

  - running Terraform a second time, or
  - explicitly declaring dependencies via the `depends_on` attribute.
@catsby
Copy link
Contributor

catsby commented May 5, 2015

Hey @bitglue

Thanks for offering this pull request, but we’ve decided to decline. While I admit that simply retrying over and over is not the perfect solution, I’m still convinced that this limited retrying is the correct behavior at this time.

We are hoping that recent enhancements with modules can help alleviate the pain you’ve felt here, and we’re still looking for the right solution overall, but for now retrying for a reasonable amount of time is the desired behavior. We’ve found that in more situations than not, a simple retry (or 2) will address several eventually consistent type errors we’ve experienced with AWS.

I agree with you however that 5 minutes it too long. I’ve lowered that to 2 minutes in commit 2724153. In my experience, this type of dependency violation is typically cleared up in under 1 minute. I understand that some situations like the one you describe will never clear up, but we still felt 2 minutes was a reasonable amount of time.

Thank you again.

@catsby catsby closed this May 5, 2015
@bitglue
Copy link
Author

bitglue commented May 5, 2015

@catsby If that's the case, you might want to look at #1732.

@fatih
Copy link
Contributor

fatih commented Oct 12, 2015

We also hit the same problem. There is an error DependencyViolation, but Terraform keeps trying to detach the internet gateway for 5 minutes and just times out. The reason of course is because they are other resources, which are not defined in Terraform, but uses the resources here (in our case internet_gateway. )

I'm now going to fork Terraform and probably use a custom implementation until we find a better solution. But I don't know why there is no error returned, even though there is an error? Do I miss something here?

@ghost
Copy link

ghost commented Apr 30, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants