Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two serious problems when using links in Docker containers #2327

Merged
merged 3 commits into from
Jun 29, 2015

Conversation

jefferai
Copy link
Member

  1. When linking to other containers, introduce a slight delay; this lets the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.

  2. Links cause there to be more than one name for a container to be returned. As a result, only looking at the first element of the container names could cause a container to not be found, leading Terraform to remove it from state and attempt to recreate it.

@jefferai jefferai changed the title When linking to other containers, introduce a slight delay; this lets Fix two serious problems when using links in Docker containers Jun 12, 2015
@phinze
Copy link
Contributor

phinze commented Jun 12, 2015

A bit confused on the scenario necessitating (1) - if you're referencing other containers managed by Terraform, there will be a dependency drawn causing them to be started in order.

(2) LGTM

@jefferai
Copy link
Member Author

@phinze Docker's API is asynchronous, so they're started in the correct order, but too quickly. The first container doesn't have a time to get into the "running" state before the second one is started; when the second one is started and going through its checks, it sees that the first container is not yet running and bails, saying that it can't link to a non-running container.

One possibility, rather than sleeping for a few seconds to avoid this condition altogether, could be to go through a loop with very short delays until the start command succeeds. However, if the start command fails for some other reason than links, it would probably be preferable not to keep issuing the command over and over. I can see potential snowballing situations. That said, it would cause the delay to be shorter and a bit less arbitrary.

@jefferai
Copy link
Member Author

@phinze Any update, or comments on the previous answer?

@mitchellh
Copy link
Contributor

@jefferai Is there any status API we can check for the container to be running? In other providers we use that. Otherwise, a sleep would be easiest I think, though it has the weak points that you brought up. If you feel confident you can make the start robust enough, then retrying that would be fine too.

@jefferai
Copy link
Member Author

@mitchellh There is a status API. I'll refactor to check/sleep on the status; if enough time goes by without the linked-to container coming up (30 seconds, maybe?) then bail, and attempt to clean up. Does that sound good?

@mitchellh
Copy link
Contributor

Yep!

@jefferai
Copy link
Member Author

@mitchellh One note is that this may mean containers bailing if the linked-to container needs to download a large number of Docker layers, which should generally only be a first-time problem. Either the timeout could be configurable, or the user could simply re-run terraform in this instance, and the layers should already be downloaded.

@mitchellh
Copy link
Contributor

I'll defer to you for the best user experience... I wonder if we just artificially bump the timeout if the image didn't exist? Do we know that? I'm not sure. :( Does the status API not tell us if it is downloading images?

@jefferai
Copy link
Member Author

As far as I know the status API doesn't tell us if it is downloading images. The problem is that we may not know what layers the linked-to container needs to fetch -- so we could try to detect whether the linked-to container's stated image is currently downloaded, and keep advancing the timeout until we see it downloaded, at which point we give the container X seconds to start up. More complicated, but probably better UX. I'll head down that path.

@phinze
Copy link
Contributor

phinze commented Jun 24, 2015

@jefferai in other resources, we've made it the responsibility of the Create to block until the resource is truly ready. So this means that the dependency should block on Create until it's running, and the dependent resource can safely assume all its dependencies are ready. This has worked nicely for us elsewhere - do you think it would work here as well?

@jefferai
Copy link
Member Author

@phinze Yes -- see the discussion above with Mitchell. That's the plan. :-)

@phinze
Copy link
Contributor

phinze commented Jun 24, 2015

@jefferai cool - just wanted to make sure that we're polling on the tail end of create, not on the presence of the linked container

@jefferai
Copy link
Member Author

@phinze The plan as I understand it:

  1. Look for the linked container (note that you can specify it like
    links = [
        "${docker_container.db.name}:db"
    ]

This should cause the dependency management of Terraform to kick in, which should mean that creation of the dependent container should not happen until the dependency is created.

Note that the problem here is about startup. I realize looking at the conversation today that I muddled things up a bit. Using the syntax above, before the current container is started up, the other container should have fully pulled its layers, because it needs to do this in order to start. This actually simplifies things.

Therefore, if at this point the linked container does not exist, bail -- something is wrong.

  1. If the container is not running, refresh its state every half a second or so to check whether it has entered the running state.

  2. If it never enters the running state, bail -- we won't be able to start the current container either

There are other possible states that the container can be in, but since a container can be started after it has previously failed, the other states aren't really reliable. The one thing we can check in a semi-reliable fashion is whether the linked container has exited after we began checking its status -- anything before then could simply be the container not having successfully executed its start command (the very problem that this is trying to solve).

@phinze
Copy link
Contributor

phinze commented Jun 24, 2015

@jefferai my point is that the docker_container.db resource should never have completed until it was in a usable state, which allows the resource specifying it in links (lets call it docker_container.app) to simply assume it's in the proper state. The terraform dependency drawn by the link means that docker_container.db needs to be visited before docker_container.app is handled, and if we make it the responsibility for each resource to block until it is ready to be used, no logic needs to be performed for handling links - they can be assumed ready / error if not.

@jefferai
Copy link
Member Author

@phinze I see what you're saying, but there is one problem, which is that containers can start and then fail. So right now Terraform checks whether Docker's StartContainer command returns successfully; if it does, then Terraform also returns successfully. In between then and when a container linking to it starts, the container could enter the running state and then fail. This could be very rapid, or it could take some time, depending on a lot of factors.

So I could put in logic to wait for the container to get into the running state before Terraform returns from creating the container -- but it can still happen that the container runs and then exits, and so linking from the next container fails. Of course, we could just say that in that scenario we're going to punt and allow the Docker error to get through to the user, because the container you're trying to link to is in fact not running.

What do you think?

@phinze
Copy link
Contributor

phinze commented Jun 24, 2015

@jefferai Yep - that makes sense, but I think the general division of responsibilities still applies:

  • Create should make a best-effort to ensure that the resource has started successfully before it returns
  • Subsequent resources should rely on this behavior in Create, and assume that their dependencies are ready to be used

I think adhering to this pattern will keep the implementation simpler overall.

So if we'd like to handle "container starts and fails within N seconds" gracefully (which I could see being a valuable feature, perhaps exposing N as a tunable config value?) that's still something we'd tack at the end of Create. If a container fails in the time between being created and being linked, that's a failure we expose immediately and inform the user that they can increase N or investigate their environment for the source of the failure.

How does that sound to you?

@jefferai
Copy link
Member Author

Sounds good. I have a fix coded now; tomorrow I will test it out and push when I'm satisfied with it.

@mitchellh
Copy link
Contributor

Thanks @jefferai

jefferai added 2 commits June 25, 2015 14:40
the Docker API get those containers running. Otherwise when
you try to start a container linking to them, the start command
will fail, leading to an error.
Links cause there to be more than one name for a container to be
returned. As a result, only looking at the first element of the
container names could cause a container to not be found, leading
Terraform to remove it from state and attempt to recreate it.
@jefferai
Copy link
Member Author

@phinze So I have changes that implement the behavior that you suggested; I do want to point out that if the container comes up and then fails very rapidly it will still have entered "running" state, so downstream containers will think things are fine, until they fail to actually start. But I agree that this is probably correct behavior to code in; making an enhancement on the starting side of the secondary container would be a good future improvement. I'll push this in just a moment.

I did notice that if a container Y links to container X, and container X is restarted (due, for instance, to a configuration change), container Y is likely to then fail and stop. Arguably that container Y should be restarted in this instance, but it's definitely not a slam-dunk line of reasoning, because if something you depend on fails and comes back up, you may want to e.g. examine the database for corruption, before restarting an app connecting to it.

I think this pull request can be merged; future enhancements can come in future pull requests.

favor of attempting to detect if the initial container ever enters
running state, and erroring out if not. It will re-check the container
once every 500ms for 15 seconds total; future work could make that
configurable.
@phinze
Copy link
Contributor

phinze commented Jun 29, 2015

LGTM! Thanks for the time and effort here @jefferai

phinze added a commit that referenced this pull request Jun 29, 2015
Fix two serious problems when using links in Docker containers
@phinze phinze merged commit b26df75 into hashicorp:master Jun 29, 2015
@ghost
Copy link

ghost commented May 1, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators May 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants