Fix two serious problems when using links in Docker containers #2327

jefferai · 2015-06-12T19:01:42Z

When linking to other containers, introduce a slight delay; this lets the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.
Links cause there to be more than one name for a container to be returned. As a result, only looking at the first element of the container names could cause a container to not be found, leading Terraform to remove it from state and attempt to recreate it.

phinze · 2015-06-12T21:54:10Z

A bit confused on the scenario necessitating (1) - if you're referencing other containers managed by Terraform, there will be a dependency drawn causing them to be started in order.

(2) LGTM

jefferai · 2015-06-13T22:04:56Z

@phinze Docker's API is asynchronous, so they're started in the correct order, but too quickly. The first container doesn't have a time to get into the "running" state before the second one is started; when the second one is started and going through its checks, it sees that the first container is not yet running and bails, saying that it can't link to a non-running container.

One possibility, rather than sleeping for a few seconds to avoid this condition altogether, could be to go through a loop with very short delays until the start command succeeds. However, if the start command fails for some other reason than links, it would probably be preferable not to keep issuing the command over and over. I can see potential snowballing situations. That said, it would cause the delay to be shorter and a bit less arbitrary.

jefferai · 2015-06-24T15:20:43Z

@phinze Any update, or comments on the previous answer?

mitchellh · 2015-06-24T15:21:50Z

@jefferai Is there any status API we can check for the container to be running? In other providers we use that. Otherwise, a sleep would be easiest I think, though it has the weak points that you brought up. If you feel confident you can make the start robust enough, then retrying that would be fine too.

jefferai · 2015-06-24T15:26:10Z

@mitchellh There is a status API. I'll refactor to check/sleep on the status; if enough time goes by without the linked-to container coming up (30 seconds, maybe?) then bail, and attempt to clean up. Does that sound good?

mitchellh · 2015-06-24T15:26:36Z

Yep!

jefferai · 2015-06-24T15:29:46Z

@mitchellh One note is that this may mean containers bailing if the linked-to container needs to download a large number of Docker layers, which should generally only be a first-time problem. Either the timeout could be configurable, or the user could simply re-run terraform in this instance, and the layers should already be downloaded.

mitchellh · 2015-06-24T15:30:31Z

I'll defer to you for the best user experience... I wonder if we just artificially bump the timeout if the image didn't exist? Do we know that? I'm not sure. :( Does the status API not tell us if it is downloading images?

jefferai · 2015-06-24T15:32:28Z

As far as I know the status API doesn't tell us if it is downloading images. The problem is that we may not know what layers the linked-to container needs to fetch -- so we could try to detect whether the linked-to container's stated image is currently downloaded, and keep advancing the timeout until we see it downloaded, at which point we give the container X seconds to start up. More complicated, but probably better UX. I'll head down that path.

phinze · 2015-06-24T16:33:11Z

@jefferai in other resources, we've made it the responsibility of the Create to block until the resource is truly ready. So this means that the dependency should block on Create until it's running, and the dependent resource can safely assume all its dependencies are ready. This has worked nicely for us elsewhere - do you think it would work here as well?

jefferai · 2015-06-24T16:46:36Z

@phinze Yes -- see the discussion above with Mitchell. That's the plan. :-)

phinze · 2015-06-24T17:11:30Z

@jefferai cool - just wanted to make sure that we're polling on the tail end of create, not on the presence of the linked container

jefferai · 2015-06-24T17:37:11Z

@phinze The plan as I understand it:

Look for the linked container (note that you can specify it like

    links = [
        "${docker_container.db.name}:db"
    ]

This should cause the dependency management of Terraform to kick in, which should mean that creation of the dependent container should not happen until the dependency is created.

Note that the problem here is about startup. I realize looking at the conversation today that I muddled things up a bit. Using the syntax above, before the current container is started up, the other container should have fully pulled its layers, because it needs to do this in order to start. This actually simplifies things.

Therefore, if at this point the linked container does not exist, bail -- something is wrong.

If the container is not running, refresh its state every half a second or so to check whether it has entered the running state.
If it never enters the running state, bail -- we won't be able to start the current container either

There are other possible states that the container can be in, but since a container can be started after it has previously failed, the other states aren't really reliable. The one thing we can check in a semi-reliable fashion is whether the linked container has exited after we began checking its status -- anything before then could simply be the container not having successfully executed its start command (the very problem that this is trying to solve).

phinze · 2015-06-24T17:42:52Z

@jefferai my point is that the docker_container.db resource should never have completed until it was in a usable state, which allows the resource specifying it in links (lets call it docker_container.app) to simply assume it's in the proper state. The terraform dependency drawn by the link means that docker_container.db needs to be visited before docker_container.app is handled, and if we make it the responsibility for each resource to block until it is ready to be used, no logic needs to be performed for handling links - they can be assumed ready / error if not.

jefferai · 2015-06-24T18:00:07Z

@phinze I see what you're saying, but there is one problem, which is that containers can start and then fail. So right now Terraform checks whether Docker's StartContainer command returns successfully; if it does, then Terraform also returns successfully. In between then and when a container linking to it starts, the container could enter the running state and then fail. This could be very rapid, or it could take some time, depending on a lot of factors.

So I could put in logic to wait for the container to get into the running state before Terraform returns from creating the container -- but it can still happen that the container runs and then exits, and so linking from the next container fails. Of course, we could just say that in that scenario we're going to punt and allow the Docker error to get through to the user, because the container you're trying to link to is in fact not running.

What do you think?

phinze · 2015-06-24T18:13:00Z

@jefferai Yep - that makes sense, but I think the general division of responsibilities still applies:

Create should make a best-effort to ensure that the resource has started successfully before it returns
Subsequent resources should rely on this behavior in Create, and assume that their dependencies are ready to be used

I think adhering to this pattern will keep the implementation simpler overall.

So if we'd like to handle "container starts and fails within N seconds" gracefully (which I could see being a valuable feature, perhaps exposing N as a tunable config value?) that's still something we'd tack at the end of Create. If a container fails in the time between being created and being linked, that's a failure we expose immediately and inform the user that they can increase N or investigate their environment for the source of the failure.

How does that sound to you?

jefferai · 2015-06-24T20:09:54Z

Sounds good. I have a fix coded now; tomorrow I will test it out and push when I'm satisfied with it.

mitchellh · 2015-06-25T06:08:25Z

Thanks @jefferai

the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.

Links cause there to be more than one name for a container to be returned. As a result, only looking at the first element of the container names could cause a container to not be found, leading Terraform to remove it from state and attempt to recreate it.

jefferai · 2015-06-25T15:10:19Z

@phinze So I have changes that implement the behavior that you suggested; I do want to point out that if the container comes up and then fails very rapidly it will still have entered "running" state, so downstream containers will think things are fine, until they fail to actually start. But I agree that this is probably correct behavior to code in; making an enhancement on the starting side of the secondary container would be a good future improvement. I'll push this in just a moment.

I did notice that if a container Y links to container X, and container X is restarted (due, for instance, to a configuration change), container Y is likely to then fail and stop. Arguably that container Y should be restarted in this instance, but it's definitely not a slam-dunk line of reasoning, because if something you depend on fails and comes back up, you may want to e.g. examine the database for corruption, before restarting an app connecting to it.

I think this pull request can be merged; future enhancements can come in future pull requests.

favor of attempting to detect if the initial container ever enters running state, and erroring out if not. It will re-check the container once every 500ms for 15 seconds total; future work could make that configurable.

phinze · 2015-06-29T19:36:36Z

LGTM! Thanks for the time and effort here @jefferai

Fix two serious problems when using links in Docker containers

ghost · 2020-05-01T02:32:14Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jefferai changed the title ~~When linking to other containers, introduce a slight delay; this lets~~ Fix two serious problems when using links in Docker containers Jun 12, 2015

radeksimko added bug provider/docker labels Jun 13, 2015

jefferai added 2 commits June 25, 2015 14:40

When linking to other containers, introduce a slight delay; this lets

2e01e06

the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.

jefferai force-pushed the f-delay-on-link branch from 45f7881 to edbc578 Compare June 25, 2015 15:12

phinze added a commit that referenced this pull request Jun 29, 2015

Merge pull request #2327 from jefferai/f-delay-on-link

b26df75

Fix two serious problems when using links in Docker containers

phinze merged commit b26df75 into hashicorp:master Jun 29, 2015

ghost locked and limited conversation to collaborators May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix two serious problems when using links in Docker containers #2327

Fix two serious problems when using links in Docker containers #2327

jefferai commented Jun 12, 2015

phinze commented Jun 12, 2015

jefferai commented Jun 13, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 25, 2015

jefferai commented Jun 25, 2015

phinze commented Jun 29, 2015

ghost commented May 1, 2020

Fix two serious problems when using links in Docker containers #2327

Fix two serious problems when using links in Docker containers #2327

Conversation

jefferai commented Jun 12, 2015

phinze commented Jun 12, 2015

jefferai commented Jun 13, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

phinze commented Jun 24, 2015

jefferai commented Jun 24, 2015

mitchellh commented Jun 25, 2015

jefferai commented Jun 25, 2015

phinze commented Jun 29, 2015

ghost commented May 1, 2020