-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368
Conversation
Hey @bitglue – thanks for the contribution. I'm sorry 3d80057 caused you so much trouble. We've found the pattern of retrying a deletion in the event of a dependency violation to be helpful in a handful of places, this one specifically. We felt it was a more fault-tolarent approach that minimized the amount of manual Of course, perhaps I was too zealous to introduce that here, which we can discuss if that's the case. Was the error swallowed up completely? E.g. Terraform made no mention to you of a dependency violation? If so, do you think ensuring we bubble up that error, and perhaps use a shorter wait window (1, maybe 2 minutes), would that be acceptable? |
I'd also share some relevant experience from Puppet here. Puppet has autorequire, where it tries to add implicit dependencies based on rules specific to each resource. The trouble is, I'm never sure if a particular thing would autorequire or not. In theory, I can look at the DAG or read the documentation to confirm that a thing does autorequire like I hope it does, but in practice that's more work than just typing the dependency. Besides, if it's not plainly obvious from reading the code how something will behave, I won't do it. That trouble I have in figuring it out is also a problem for everyone who reads the code in the future. Consequently, I just never used autorequire knowingly. Terraform is a lot better than Puppet in this respect because it automatically adds dependencies based on what variables I reference. This covers at least 95% of cases. It makes logical sense and it's explicit without being verbose. Puppet doesn't do that...if you reference a variable that isn't defined yet because you had the dependencies wrong, it just helpfully interpolates an empty string for you. 😠 As such, most of the reason why Puppet needs autorequire is really addressed by Terraform's implicit-dependency-on-interpolation logic. |
@catsby Yeah, the error is swallowed completely. See my previous comment, which I was writing while you were writing yours. Maybe if the wait window was like 5 seconds, it would be acceptablish. But really, I just don't like this kind of behavior. There are a lot of Terraform errors that can be corrected by retrying them, but that's not the right behavior. Explicit is better than implicit. See principle of least astonishment and DWIM. |
Hey @pmoust - thanks for this feedback. The puppet I think you're absolutely right that non-obvious implicit behavior often leads to confusion and sadness, and Terraform's style is definitely to keep thinks as explicit and obvious as possible. Perhaps this means that when Terraform providers gain knowledge of implicit dependencies, they simply use it to generate validation errors pointing out to the user how the configuration needs to be corrected. We'll see! Ultimately the change in question here is purely a stopgap, because Terraform core does not expose the proper tools to the provider to be able to express the correct behavior. So the question is "what is the best behavior in the interim?" Give us a chance to talk it over and we'll get back to you soon. |
026350b
to
1bab291
Compare
I've rebased this onto the current master. Had to resolve some conflicts due to this function moving to the non-vendored AWS SDK. @phinze any thoughts on this issue? I continue to encounter it multiple times per week (when I'm not running my build which includes this PR), and it's extremely annoying. Even though I know what the problem is now, I'm faced with 2 options:
Further compounding this issue is #922 which means the thing I'm reverting here doesn't even work as intended. If the subnet, and the instances (or whatever) in the subnet are defined in a module, then Terraform walks the destroy graph backwards, so you can wait forever and still the instances in the subnet won't be deleted. Terraform won't even try to delete the instances in the subnet because it thinks it needs to delete the subnet before it can delete the instances. |
Hey @bitglue – I've submitted #1455 as a proposal to address your pain but also satisfy others who benefit from this retry logic. Personally, I'd like to apologize for causing you this grief! Obviously I only meant to improve things 😄. Regarding DWIM, I thought I was… A user asked Terraform to delete a thing or things, and it did. The implementation of how it does shouldn’t be so important, or so I believe(d). That said, completely hiding the true error, which was unintentional, and under a 5 timeout was both bad and incorrect, so I apologize for the lost time and frustration. Please check out #1455 and let me know what you think, and thanks for your continued contributions to Terraform 😄 |
ec38c86
to
81019d2
Compare
…ity groups" This reverts commit 3d80057. Rationale: There is no guarantee that waiting will resolve the DependencyViolation. It's possible, perhaps likely, that the thing preventing the subnet deletion was created by some other means -- by another user, by a dynamic application, in another manifest, or that it was created by this manifest but for some other reason failed to be deleted. In these cases, the retry behavior makes the user wait 5 minutes only to receive the misleading error: Error deleting subnet: timeout while waiting for state to become 'destroyed' The obvious response to this is to try again, which yields another 5 minutes of waiting. The previous behavior was to fail quickly with an error which exactly described the problem. While this is mildly annoying, it's at least apparent what is happening. The situation the original commit was intended to address could be more elegantly addressed by: - running Terraform a second time, or - explicitly declaring dependencies via the `depends_on` attribute.
Hey @bitglue Thanks for offering this pull request, but we’ve decided to decline. While I admit that simply retrying over and over is not the perfect solution, I’m still convinced that this limited retrying is the correct behavior at this time. We are hoping that recent enhancements with modules can help alleviate the pain you’ve felt here, and we’re still looking for the right solution overall, but for now retrying for a reasonable amount of time is the desired behavior. We’ve found that in more situations than not, a simple retry (or 2) will address several eventually consistent type errors we’ve experienced with AWS. I agree with you however that 5 minutes it too long. I’ve lowered that to 2 minutes in commit 2724153. In my experience, this type of dependency violation is typically cleared up in under 1 minute. I understand that some situations like the one you describe will never clear up, but we still felt 2 minutes was a reasonable amount of time. Thank you again. |
We also hit the same problem. There is an error I'm now going to fork Terraform and probably use a custom implementation until we find a better solution. But I don't know why there is no error returned, even though there is an error? Do I miss something here? |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
This reverts commit 3d80057.
Rationale: There is no guarantee that waiting will resolve the
DependencyViolation. It's possible, perhaps likely, that the thing
preventing the subnet deletion was created by some other means -- by
another user, by a dynamic application, in another manifest, or that it
was created by this manifest but for some other reason failed to be
deleted.
In these cases, the retry behavior makes the user wait 5 minutes only to
receive the misleading error:
Error deleting subnet: timeout while waiting for state to become
'destroyed'
The obvious response to this is to try again, which yields another 5
minutes of waiting.
The previous behavior was to fail quickly with an error which exactly
described the problem. While this is mildly annoying, it's at least
apparent what is happening.
The situation the original commit was intended to address could be more
elegantly addressed by:
depends_on
attribute.