thinking: dependency graphs are hard, so just don't do them #1732

bitglue · 2015-04-29T17:30:25Z

The resource dependency graph in Terraform is problematic.

It's hard for users to get right (#934), which leads to kludges (#1252) which create new problems (#1368).

It's hard for developers to get right (#922).

For every issue filed there are a lot more. For example, it seems the dependency information exists only in the configuration, but not in the state file. Thus, when resources are removed from the configuration, a plan is made to delete them, but not necessarily in the right order.

Also, the code to handle all this is really complex.

So I propose an alternate solution: instead of trying to build a DAG and traversing it, try a much simpler randomized algorithm: for each thing that needs to be created / destroyed / updated, arbitrarily pick on and try doing it. If it doesn't work, just try again later. The process terminates when either:

all pending operations have completed (success)
an iteration across all operations has occurred with none of them completing (failure)

This is essentially the approach taken by #1252, except:

it knows when further progress is impossible, and terminates (instead of retrying until some timeout expires even when progress is impossible)
it solves the problem for all resources, and all operations (instead of for only some operations on some resources under some specific circumstances)

Of course the disadvantage of this approach is a higher computational complexity, but given that most of Terraform's time is waiting for a remote API to respond, I think in practice the difference in run time will be insignificant, and this seems like a very small cost for a change that would fix many bugs, vastly simplify the code base, and eliminate a very large class of possible configuration errors.

Furthermore, with some really simple heuristics, the amount of failed attempts can be reduced to some number which will in practice be very small. For example, we know that in AWS, VPCs must exist before subnets, and subnets before EC2 instances, and destroying them must happen in the opposite order.

mitchellh · 2015-04-29T17:39:11Z

This is an interesting thought, but I'm against it.

The reason its the source of all bugs is simply because it is the only real component in most bugs. Also, as you said, it is a hard thing to get right. The power I think is that once we get it right 100% (and we're much much closer, only a couple known issues left), it'll be really powerful. We've already met with a lot of companies that have actually tried this random approach and for various reasons it didn't work for them. We get a lot of praise from these companies for trying to do it "the right way" (as they see it, and as we see it too).

Furthermore, just to address your points: we do store the graph in the state. The issue you point to is a bug with how modules are handled, but it isn't lacking information in the state, it is simply lacking enough verbosity in the graph itself. We've stored the graph in the state since v0.2 I believe.

But for practical reasons, I believe it is still the right approach: we're about to implement pre-destroy, pre-create, etc. provisioner hooks. Some provisioning depends on other nodes so the order has to be: 1.) create resource A 2.) create resource B 3.) provision resource A 4.) provision resource B 5.) post-create resource A.

The feature above can't be randomized: it has a very strict order, and its hard to do this without a graph.

I'm not worried about computational complexity, for the same reasons you enumerate.

Additionally, CloudFormation internals from what I've heard are also just heuristics and randomization, and they very often don't complete things at all (just error, you run again and it works). Terraform very rarely has this issue, now or in the past. The issue has always been fairly consistent. And this is thanks to the graph.

I think as we build more features into Terraform, the features we have planned out have a pretty core requirement on the graph.

bitglue · 2015-04-29T18:33:17Z

The feature above can't be randomized: it has a very strict order, and its hard to do this without a graph.

A randomized algorithm to graph traversal is not the same as traversing a graph in random order. If you have an operation which can't happen until some other things have happened first, then you implement that operation as:

function do_the_thing():
  if the_things_that_must_happen_first_aren't_done_yet():
    return NotReadyYet

  really_do_the_thing()

The operations are still guaranteed to happen in the right order.

But what this allows you to do is distribute the knowledge of the "right order" among the implementations of the operations, rather than trying to have one thing which tries to build a comprehensive, error-free graph beforehand.

And that allows you to incorporate outside knowledge about what the "right order" is. AWS can tell you what the right order is. If you try to delete something in the wrong order, it returns a DependencyViolation. If you try to create something in the wrong order, it will give you a InvalidInstanceID.NotFound or similar. AWS knows a lot better than Terraform what the right order is. In order to be bug-free, Terraform, or the user, needs to know everything that AWS knows.

But that's impossible. The AWS API is eventually consistent, and some state like "does this instance exist?" has a different value depending on which node you end up asking, and at which time. You can not distill it to a boolean value. They say:

To manage eventual consistency, you can do the following:

Confirm the state of the resource before you run a command to modify it. Run the appropriate Describe command using an exponential backoff algorithm to ensure that you allow enough time for the previous command to propagate through the system. To do this, run the Describe command repeatedly, starting with a couple of seconds of wait time, and increasing gradually up to five minutes of wait time.

Add wait time between subsequent commands, even if a Describe command returns an accurate response. Apply an exponential backoff algorithm starting with a couple of seconds of wait time, and increase gradually up to about five minutes of wait time.

Terraform currently does the first thing, sometimes, for some operations on some resources. However, it doesn't do the second thing ever, that I've seen. I've run into the consequences: Terraform configurations which fail the first time, but complete on a second. And it's not a question of Terraform's graph being incorrect: it's a consequence of Terraform making assumptions about the AWS data model that are not true.

Because of eventual consistency, Terraform can not know if an operation has taken effect or not. You might call a Describe method and see that some instance was created, but there is no guarantee that the next API call you make will see the same state of the world. The only guarantee is that eventually it will. The only solution which is guaranteed to be correct is to keep trying things until they work, or you decide to give up.

In other words, Terraform makes the assumption that it can know something like, "is this instance created or not?" But that's a fallacious assumption: in reality that question has many answers, depending on which host at Amazon you happen to ask.

I can find dozens of places where the AWS provider has had to deal with eventual consistency: just grep for WaitForState. I can find even more places where there is no correct handling of eventual consistency, and things happen to work by luck, until they don't. And after you've tracked down all these unlucky, hard to reproduce cases among all the resources and operations and wrapped the individual API calls in WaitForState, you've essentially implemented the algorithm I've proposed here.

mitchellh · 2015-04-29T18:37:53Z

Everything you've described sounds more like a provider bug than a Terraform core issue (or even a graph issue). I don't think it makes sense to do a random retry to every graph node in Terraform because AWS needs it. I think the proper solution is to wrap retries around the resources in AWS. For example: Google's API is not eventually consistent, nor is Azure. When they tell you something is done, it is done. And so are most of the other providers that TF implements.

AWS is the only one that has a pretty bad developer experience for this reason. I think the correct approach is to make it easier within the AWS provider to wrap the resources. WaitForState is one approach we've taken, but perhaps we can make it even easier by wrapping them at a higher level. One issue really is that you don't get idempotency for free with AWS. You have to just retry everything. For example creating an AWS instance isn't one API, its a few. If the one in the middle messes up, you have to deal with that, and a blanket "just retry the whole create" won't fix that. It still requires the providers to be a bit smarter than that no matter what.

ravenac95 · 2015-04-29T22:56:40Z

@mitchellh I haven't been able to dive into terraform internals, but is there not a retry loop that ensures that a specific resource gets created upon a failed step? I'm not sure if it's antithetical to the TF design philosophy, but it seems like a tiny UX enhancement to simply run the failing step a few times and after a certain point recognize that you're beating a dead horse. Granted, it's possible I'm not recognizing a scenario where you wouldn't want that to happen. In a case like that however, perhaps something like a noRetry flag on the resource definition could suffice.

At the end of the day, it's not the worst thing in the world to rerun TF, it'd just be something beautiful to run TF and have it be robust enough to handle eventual consistency issues of something like AWS.

mitchellh · 2015-04-30T01:31:38Z

@ravenac95 Yeah we have the ability to do that. We don't retry a resource within a single run right now but I suppose we could. I believe the CRUD methods are supposed to be idempotent but let me check the docs.

bitglue · 2015-04-30T02:16:54Z

Everything you've described sounds more like a provider bug than a Terraform core issue (or even a graph issue). I don't think it makes sense to do a random retry to every graph node in Terraform because AWS needs it.

Well maybe so, but that's not the only reason I'm making this proposal. What about cases where there are implicit dependencies, such as in #934? AWS can't be the only provider where something like this can happen.

You could teach Terraform about every dependency every provider might have, like "you need to terminate all the EC2 instances before you can delete the internet gateway", and so on. That sounds like a lot of work to me. Or you can burden the user with specifying every dependency explicitly.

Or, you can take the naive approach from #1252 where an individual provider just retries an operation with no knowledge about the rest of the system, assuming, frequently incorrectly, that some other part of Terraform is eventually going to make the change necessary for progress.

Or, you can do this:

provider: I can't do my thing right now.
terraform: Noted. Let me get back to you.
terraform: OK, I was able to change some things that might have fixed your problem. Can you try again?
(repeat until either)
provider: Thanks! That worked!
(or)
provider: Nope, still can't do my thing.
terraform: Darn. I don't have anything else to try, so it's unlikely any further attempts to retry will succeed. I'll stop now, and pass your failure on to the user.

Again, I want to emphasize that this approach is not a random approach. You can still have deterministic ordering of operations, every bit as much as you have them with the current DAG traversal. The functional difference is that instead of requiring the core of Terraform to know everything about the correct order of operations before it can even begin, the correct order of operations becomes a distributed problem to which providers can bring their specific knowledge. And, it's a hell of a lot easier and simpler to implement to boot, which in practice means less bugs.

mitchellh · 2015-04-30T03:15:34Z

We haven't thought of implicit dependencies yet but we'll certainly have to cross that bridge. But because we haven't thought of them I apologize I can't really discuss them with any sound reasoning.

We'll keep this in mind.

mitchellh · 2015-04-30T03:17:35Z

Pressed enter too soon. :)

I think as the issue comes up further we'll keep it in mind. There is likely some hybrid approach here which is doable, but like I said we haven't given it much thought. In response to the original issue above though: we can't throw the dependency graphs, or won't, however you choose to interpret it. From our perspective, they add a lot of value and correctness. There are edge cases as you brought up and I agree with you that the current implementation makes handling some of them difficult, but we'll think about them soon enough.

So I think we're in agreement that there are issues with implicit dependencies (I never disagreed with that), but I disagree with your original point that we throw away the dag.

knuckolls · 2015-04-30T05:32:43Z

Fwiw, I think this is an interesting idea. That said, the dag is the single most valuable piece of terraform in my opinion.

mitchellh · 2015-04-30T05:48:07Z

@knuckolls Agreed, I think there is a hybrid approach that can take some ideas from here along with our dag to get the best of both worlds in the future.

knuckolls · 2015-04-30T05:48:55Z

👍

bitglue · 2015-04-30T14:45:59Z

I think this proposal is being fundamentally misunderstood. There is the DAG, and then there are the semantics that it provides. Is it actually the DAG that's valuable? Or is it the semantics it provides?

mitchellh · 2015-04-30T17:00:04Z

Perhaps I am misunderstanding.

I think both are valuable, but we get the semantics of a DAG from the DAG. We also get more from the DAG in addition to semantics. There have been multiple features (but NAMELY create-before-destroy) that Paul, Armon, and I have done or looked at and said "well this would be pretty much impossible without a graph abstraction."

I think this issue has been valuable, I think retry logic is important and of course we have to think about implicit dependencies better (but not in this issue, since we have another for that).

If you think that the DAG should be gotten rid of, then I think seeing would be better here then endless commenting: none of the tests in context_test.go rely on the graph, they're nearly black box. If you make a DAG-less Terraform that can have a refresh, plan, apply, and apply-with-create-before-destroy test pass, then that would be interesting to see.

mitchellh · 2015-04-30T17:00:48Z

I'm going to silence this issue now, I don't think there is any more I can get out of it. Thanks for bringing it up, and let me know if you get that PR in. I'm sorry to do that to you, but the last few comments have been mostly a lot of words without a lot of progress being made on either side. I'd rather let the code speak.

mitchellh · 2015-04-30T17:01:05Z

(silence the issue for me, I'm no longer watching it. I won't lock the topic)

ghost · 2020-05-03T02:03:57Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mitchellh closed this as completed Apr 29, 2015

This was referenced May 4, 2015

Implement a subset of IAM resources #939

Merged

Revert "provider/aws: Fix dependency violation with subnets and security groups" #1368

Closed

Race condition in aws_internet_gateway creation #1657

Closed

ljankowski mentioned this issue Apr 5, 2016

EIP: AddressLimitExceeded, nat gateways with count > than 2 #6018

Closed

ghost locked and limited conversation to collaborators May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thinking: dependency graphs are hard, so just don't do them #1732

thinking: dependency graphs are hard, so just don't do them #1732

bitglue commented Apr 29, 2015

mitchellh commented Apr 29, 2015

bitglue commented Apr 29, 2015

mitchellh commented Apr 29, 2015

ravenac95 commented Apr 29, 2015

mitchellh commented Apr 30, 2015

bitglue commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

knuckolls commented Apr 30, 2015

mitchellh commented Apr 30, 2015

knuckolls commented Apr 30, 2015

bitglue commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

ghost commented May 3, 2020

thinking: dependency graphs are hard, so just don't do them #1732

thinking: dependency graphs are hard, so just don't do them #1732

Comments

bitglue commented Apr 29, 2015

mitchellh commented Apr 29, 2015

bitglue commented Apr 29, 2015

mitchellh commented Apr 29, 2015

ravenac95 commented Apr 29, 2015

mitchellh commented Apr 30, 2015

bitglue commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

knuckolls commented Apr 30, 2015

mitchellh commented Apr 30, 2015

knuckolls commented Apr 30, 2015

bitglue commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

mitchellh commented Apr 30, 2015

ghost commented May 3, 2020