-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix two serious problems when using links in Docker containers #2327
Conversation
A bit confused on the scenario necessitating (1) - if you're referencing other containers managed by Terraform, there will be a dependency drawn causing them to be started in order. (2) LGTM |
@phinze Docker's API is asynchronous, so they're started in the correct order, but too quickly. The first container doesn't have a time to get into the "running" state before the second one is started; when the second one is started and going through its checks, it sees that the first container is not yet running and bails, saying that it can't link to a non-running container. One possibility, rather than sleeping for a few seconds to avoid this condition altogether, could be to go through a loop with very short delays until the start command succeeds. However, if the start command fails for some other reason than links, it would probably be preferable not to keep issuing the command over and over. I can see potential snowballing situations. That said, it would cause the delay to be shorter and a bit less arbitrary. |
@phinze Any update, or comments on the previous answer? |
@jefferai Is there any status API we can check for the container to be running? In other providers we use that. Otherwise, a sleep would be easiest I think, though it has the weak points that you brought up. If you feel confident you can make the start robust enough, then retrying that would be fine too. |
@mitchellh There is a status API. I'll refactor to check/sleep on the status; if enough time goes by without the linked-to container coming up (30 seconds, maybe?) then bail, and attempt to clean up. Does that sound good? |
Yep! |
@mitchellh One note is that this may mean containers bailing if the linked-to container needs to download a large number of Docker layers, which should generally only be a first-time problem. Either the timeout could be configurable, or the user could simply re-run terraform in this instance, and the layers should already be downloaded. |
I'll defer to you for the best user experience... I wonder if we just artificially bump the timeout if the image didn't exist? Do we know that? I'm not sure. :( Does the status API not tell us if it is downloading images? |
As far as I know the status API doesn't tell us if it is downloading images. The problem is that we may not know what layers the linked-to container needs to fetch -- so we could try to detect whether the linked-to container's stated image is currently downloaded, and keep advancing the timeout until we see it downloaded, at which point we give the container X seconds to start up. More complicated, but probably better UX. I'll head down that path. |
@jefferai in other resources, we've made it the responsibility of the |
@phinze Yes -- see the discussion above with Mitchell. That's the plan. :-) |
@jefferai cool - just wanted to make sure that we're polling on the tail end of create, not on the presence of the linked container |
@phinze The plan as I understand it:
This should cause the dependency management of Terraform to kick in, which should mean that creation of the dependent container should not happen until the dependency is created. Note that the problem here is about startup. I realize looking at the conversation today that I muddled things up a bit. Using the syntax above, before the current container is started up, the other container should have fully pulled its layers, because it needs to do this in order to start. This actually simplifies things. Therefore, if at this point the linked container does not exist, bail -- something is wrong.
There are other possible states that the container can be in, but since a container can be started after it has previously failed, the other states aren't really reliable. The one thing we can check in a semi-reliable fashion is whether the linked container has exited after we began checking its status -- anything before then could simply be the container not having successfully executed its start command (the very problem that this is trying to solve). |
@jefferai my point is that the |
@phinze I see what you're saying, but there is one problem, which is that containers can start and then fail. So right now Terraform checks whether Docker's StartContainer command returns successfully; if it does, then Terraform also returns successfully. In between then and when a container linking to it starts, the container could enter the running state and then fail. This could be very rapid, or it could take some time, depending on a lot of factors. So I could put in logic to wait for the container to get into the running state before Terraform returns from creating the container -- but it can still happen that the container runs and then exits, and so linking from the next container fails. Of course, we could just say that in that scenario we're going to punt and allow the Docker error to get through to the user, because the container you're trying to link to is in fact not running. What do you think? |
@jefferai Yep - that makes sense, but I think the general division of responsibilities still applies:
I think adhering to this pattern will keep the implementation simpler overall. So if we'd like to handle "container starts and fails within N seconds" gracefully (which I could see being a valuable feature, perhaps exposing How does that sound to you? |
Sounds good. I have a fix coded now; tomorrow I will test it out and push when I'm satisfied with it. |
Thanks @jefferai |
the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.
Links cause there to be more than one name for a container to be returned. As a result, only looking at the first element of the container names could cause a container to not be found, leading Terraform to remove it from state and attempt to recreate it.
@phinze So I have changes that implement the behavior that you suggested; I do want to point out that if the container comes up and then fails very rapidly it will still have entered "running" state, so downstream containers will think things are fine, until they fail to actually start. But I agree that this is probably correct behavior to code in; making an enhancement on the starting side of the secondary container would be a good future improvement. I'll push this in just a moment. I did notice that if a container Y links to container X, and container X is restarted (due, for instance, to a configuration change), container Y is likely to then fail and stop. Arguably that container Y should be restarted in this instance, but it's definitely not a slam-dunk line of reasoning, because if something you depend on fails and comes back up, you may want to e.g. examine the database for corruption, before restarting an app connecting to it. I think this pull request can be merged; future enhancements can come in future pull requests. |
favor of attempting to detect if the initial container ever enters running state, and erroring out if not. It will re-check the container once every 500ms for 15 seconds total; future work could make that configurable.
45f7881
to
edbc578
Compare
LGTM! Thanks for the time and effort here @jefferai |
Fix two serious problems when using links in Docker containers
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
When linking to other containers, introduce a slight delay; this lets the Docker API get those containers running. Otherwise when you try to start a container linking to them, the start command will fail, leading to an error.
Links cause there to be more than one name for a container to be returned. As a result, only looking at the first element of the container names could cause a container to not be found, leading Terraform to remove it from state and attempt to recreate it.