Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate create before destroy behavior to dependencies #31316

Closed
Nuru opened this issue Jun 24, 2022 · 6 comments
Closed

Propagate create before destroy behavior to dependencies #31316

Nuru opened this issue Jun 24, 2022 · 6 comments
Labels
duplicate issue closed because another issue already tracks this problem enhancement

Comments

@Nuru
Copy link

Nuru commented Jun 24, 2022

Resources that depend on other resources should follow the lifecycle of the key resource when it is replaced.

Current Terraform Version

Terraform v1.2.0

Use-cases

I want to provision an AWS Security Group and I want to be able to modify it without a service outage. To support this, I set the lifecycle create_before_destroy option to true. This means that when a new security group needs to replace an old one (which can be triggered by simply changing its name), the new one will be ready and can be associated with other resources before the old one is destroyed.

However, a security group, in the AWS provider v4.0 model, does not directly have rules. (Use of inline rules is not recommended for many reasons, see References below.) Instead, rules are created for a security group by the aws_security_group_rule resource. A security group will not function as intended until all of its rules are created. This creates a problem.

I cannot set create_before_destroy to true on the aws_security_group_rule resources because of hashicorp/terraform-provider-aws#25173. Likely even if that bug were fixed, there would be other problems (see issues referenced in that issue) making it unwise to set create_before_destroy to true on the aws_security_group_rule resources as a general proposition, because of the difficulty in ensuring that the new rules do not duplicate the existing rules, perhaps as a consequence of changing positions in a list and using count. The issue here is avoiding duplicate rules in an existing security group.

So, if I make changes to the security group rules, the old ones are deleted before the new ones are created. This does create a brief outage, but the only fix for this would be for the provider to see that it is deleting and recreating identical resources and to move them instead. That would be a great feature, but seems too much to ask, so I will accept this limitation. (On the other hand, I will open a a feature request and or suggested bug fix for hashicorp/terraform-provider-aws#25173 and a bunch of related bugs if you think it will receive serious consideration. Just point the way.)

However, if I make a change to the security group itself, currently this is the order of operations:

  1. Destroy the existing security group rules
  2. Create the new security group
  3. Create the new security group rules
  4. Destroy the deposed security group

Attempted Solutions

I tried using a null_resource that depended on the rules and had create_before_destroy = true, triggered by the security group ID, but that caused the problematic create_before_destroy behavior on the rules even when the security group ID did not change. In fact, just the declaration of the null_resource with depends_on triggered the behavior, even if the resource was not created because of count = 0. (Is this a bug?)

Click to reveal configuration

Changing the value of var.cidr in this example causes "create replacement and then destroy" behavior for aws_security_group_rule.example, even though no changes were planned for the security group or the null resource. I would like Terraform to destroy the rule first, since the security group ID is not changing.

resource "aws_security_group" "cbd" {
  vpc_id = aws_vpc.example.id

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_security_group_rule" "example" {
  security_group_id = aws_security_group.cbd.id
  type              = "egress"
  from_port         = 0
  to_port           = 0 # [sic] from and to port ignored when protocol is "-1", warning if not zero
  protocol          = "-1"
  description       = "Allow ingress from VPC"
  cidr_blocks       = [var.sg_cidr]

  lifecycle {
    create_before_destroy = false
  }
}

resource "null_resource" "rules" {
  count = 0
  # Replacement of the security group requires re-provisioning
  triggers = {
    sg_id = aws_security_group.cbd.id
  }

  depends_on = [aws_security_group_rule.example]

  lifecycle {
    create_before_destroy = true
  }
}

Proposal

While I do not expect Terraform to know the intimate details of the resource dependencies and requirements, in this particular case, Terraform does know that the rules are being replaced because the security group ID is changing (plan flags this as forcing replacement). In this particular situation, I would like Terraform to see that the resource is being replaced because of a resource that has create before destroy behavior, and for this particular plan, implement the same behavior for the dependent resource. So the order of operations would be:

  1. Create the new security group
  2. Create the new security group rules
  3. Destroy the deposed security group rules
  4. Destroy the deposed security group

It would be fine with me if this were configurable via lifecycle or dependency (or other) blocks. The key factor here is that for rules, the create before destroy behavior is desirable and only practical when the security group they depend on is changing. So I would like that behavior to be inherited from the security group's lifecycle when that is the trigger for replacing them.

References

@jbardin
Copy link
Member

jbardin commented Jun 27, 2022

Hi @Nuru,

Because the null_resource in your example uses create_before_destroy, it forces the aws_security_group_rule to use create_before_destroy too. There is no reason to ever set create_before_destroy to false, since it cannot be overridden - it's false by default and only forced to true when it is required to prevent an impossible ordering of operations, as in your example.

The overall ordering of operations is statically defined by the structure of the configuration, and must be consistent for all cases. This is especially important when changes to the configuration remove entire resources, and the stored dependency ordering must align with the remaining config to be completed correctly. The ability for providers to influence this ordering has been brought up numerous times (see #31309 for the most recent conversation), but there have not yet been any satisfactory ideas for implementation.

Coming back to the specific example here, I'm not sure I see anything here we can fix with any ordering, as most of the problems with the behavior are the result of the provider's implementation, and theoretically changing the ordering in Terraform only accounts for one facet of this behavior. The fact that the security group rules can be separated from the security group is mostly an artifact of limitation in the legacy SDK. Seeing how there is a co-dependency between these two resources which are actually part of the same logical resource, it is probably necessary to use the embedded version of the rules to get the behavior you desire.

Aside from the specifics of the example here, can the option you are asking for be described as the following?

  • If an aws_security_group_rule is being replaced on its own, use create_before_destroy=false
  • If all aws_security_group_rule instances are being replaced along with the aws_security_group, then use create_before_destroy=true

While that does seem superficially possible in this limited example, it runs into the same issues mentioned earlier. Altering the order also changes the required ordering of the configuration upstream from the aws_security_group_rule, which means that operations which were already planned before the aws_security_group_rule was planned may no longer be valid, and may also conflict with the planned removal of resources no longer in the configuration. While there probably isn't much "upstream" from a aws_security_group_rule, we have to remember that all rules set by Terraform must apply equally to all possible resource types and combinations. While we could make more complex rules like "you can only conditionally use create_before_destroy if there are no upstream resources", these place additional burden on users to understand and use correctly for a very limited use case which appears to only be necessary to work around a bug in a particular provider.

Since this conditional ordering isn't something that fits well within the existing architecture of Terraform, it's not likely something we can implement. I'm not sure what the AWS provider developers have planned in the future for these resource types, but if they have any questions about the how the implementation would interact with Terraform we encourage them to reach out for advice.

@jbardin jbardin added the waiting-response An issue/pull request is waiting for a response from the community label Jun 27, 2022
@apparentlymart
Copy link
Contributor

From reading hashicorp/terraform-provider-aws#25173 it does seem to me like the crux of this is that it's only valid to create_before_destroy a security group rule when it's the security group ID that's changing, because security group rules don't each have their own explicit identifier and so instead the AWS provider identifies them based on their entire content.

If we try to create_before_destroy when we're not changing the security group ID then we can potentially see conflicts trying to create the same rule twice, but if the security group ID is changing then the new rule cannot possibly conflict with the old one because the uniqueness constraint only applies within a single security group.

With that said, I share @jbardin's uncertainty about to what extent Terraform Core can support this, because Terraform Core doesn't really know what a security group or a security group rule are and so relies entirely on the AWS provider to define that behavior.

This does make me think of #22094, which was a general idea about allowing providers to give Terraform Core more information about the dynamic relationships between objects, although I've not yet thought deeply about whether that sort of "global ID" strategy could help in this case. This could potentially be an example of the "containment relationship" I described over there: the security group rules are each contained within their security group, and so perhaps in that case Terraform Core could notice that it's planning to replace the security group and know that the existing security group rule will get deleted as a side-effect of deleting the security group. Could Terraform Core therefore infer that the new security group rule is somehow independent of the old one and apply a different dependency ordering to it? Maybe... but these things tend to be pretty subtle and it's often hard to devise one rule that works for everything, and so if we went this strategy we may wish to devise a different sort of relationship that is more specifically aimed at this problem.

@Nuru
Copy link
Author

Nuru commented Jun 27, 2022

@jbardin , @apparentlymart

Thank you both for your prompt and thorough replies. You have obviously thought about resource creation/destruction ordering in a lot more depth than I have, and I defer to your expertise as to why the order is statically determined and not subject to influence by count = 0. Although the reasoning behind the requirement that "stored dependency ordering must align with the remaining config to be completed correctly" eludes me at the moment, I will not ask you to elaborate on it.

(Side note: I set create_before_destroy = false in the example code for emphasis, not because I thought it had any practical effect.)

The fact that setting create_before_destroy on null_resource forces the aws_security_group_rule to use create_before_destroy too was surprising to me, but after thinking it through I see how that is a requirement: you cannot create the null_resource until after all its dependencies are created.

Aside from the specifics of the example here, can the option you are asking for be described as the following?

* If an `aws_security_group_rule` is being replaced on its own, use `create_before_destroy=false`

* If all `aws_security_group_rule` instances are being replaced along with the `aws_security_group`, 
  then use `create_before_destroy=true`

That's pretty close. The first point is completely accurate. I would restate the second point:

  • If a new security group is being created before the old one is destroyed, the new rules that depend on it should also be created before the old ones are destroyed.

The "containment relationship" that @apparentlymart mentions is at least a great starting point for further elaboration. With AWS Provider version 4.0 the provider has moved toward providing more and more separate resources that are, in fact, part of a containing resource. The security group rules being contained by the security group is one example. The breaking changes in v4.0.0 regarding S3 provide several additional examples.

Perhaps this is something that has to be done by the provider; for sure it would be something nice for the provider to do. Until then, could we create a variant of depends_on or triggers such as contained_by or identified_by that drives the desired behavior?

resource "aws_security_group_rule" "example" {
  contained_by = aws_security_group.cbd
  # or
  identified_by = aws_security_group.cbd.id
...

I don't know the internals of Terraform nearly well enough to suggest an implementation, but conceptually it could be thought of as meaning that the resource has the security group ID as part of the resource address. You would not need to use the actual ID, you could just use some stand-in that changes when and only when the security group resource is replaced.

@jbardin
Copy link
Member

jbardin commented Jun 28, 2022

Thanks @Nuru,

Since this is already linked to #22094, I think we can continue to use that issue to track any development along the same lines. Regardless of the configuration syntax one might use, or if Terraform can determine these relationships automatically, we first need to determine what a suitable provider protocol might look like, and how this new lifecycle could work within the architecture of Terraform.

Thanks!

@jbardin jbardin closed this as completed Jun 28, 2022
@Nuru
Copy link
Author

Nuru commented Jun 28, 2022

In case it helps further the discussion, or helps someone else in a similar situation, the workaround I have chosen is to force the creation of a new security group (using create before destroy) whenever any of the security group rules change. Still tricky to make that happen without running into problems at plan time, but so far it seems to be working.

@crw crw added duplicate issue closed because another issue already tracks this problem and removed waiting-response An issue/pull request is waiting for a response from the community new new issue not yet triaged labels Jun 29, 2022
@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate issue closed because another issue already tracks this problem enhancement
Projects
None yet
Development

No branches or pull requests

4 participants