Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start container xxxxx: container yyyyy not found, impossible to mount its volumes #1090

Closed
frntn opened this issue Mar 11, 2015 · 12 comments

Comments

@frntn
Copy link

frntn commented Mar 11, 2015

Context

I have :

  • a local docker client setup to talk to my local swarm manager
  • a local swarm manager setup to talk to 2 remote nodes node1 and node2
  • the remote nodes runs a docker daemon binded on a specific ip:port.

What I do

When using my local docker client I can manage my remote containers via swarm without any problem.
But then I have started using compose...

What I get

Everything is fine at first docker-compose up -d run, but then almost every re-run I get this kind of error :

Cannot start container 340e6fb486471188187308b56b3122ec674f2dde8fad77c7f8532c096a074abb: container 1c90731f5d12ce0752e3e60a575e17259f46794fa88745eba65917eb2d7e32c2 not found, impossible to mount its volumes

Searching around it turns out the first sha1 "xxxxx" is on node1 while the sha1 "yyyyy" is on node2 (!)

What I expected

Well... I'd like every layer to be created on the same node.

What I think

I'll dig into it later but I think it's an issue with the constraint:key==value environment variable

@aanand
Copy link

aanand commented Mar 13, 2015

Dependent containers will be automatically co-scheduled once #972 is merged.

@frntn
Copy link
Author

frntn commented Mar 15, 2015

My container wasn't linked to any others, but the layers were not on the same hosts.
When the PR will be merged I'll try if it fixed the issue in some way.

Thanks.

@vieux
Copy link

vieux commented Apr 8, 2015

@frntn it should be fixed if you use the last version of swarm and compose, swarm has the builtin https://github.com/docker/swarm/tree/master/scheduler/filter#dependency-filter and #972 was merged. Can you retry ?

To containers using volumes-from (I assume it's you case) will end up on the same node.

@frntn
Copy link
Author

frntn commented Apr 9, 2015

Hello @vieux and thanks for your reply.

I have retried and the problem remains.
I don't use the volumes-from nor data containers yet.
All my volumes are binded at host level.

Start without compose

When I am using the docker run command all is ok :

docker run
    --name loadbalancer \
    -p 443 \
    -e constraint:env==integ \
    -e constraint:type==dmz \
    -v /data/etc/haproxy/haproxy.cfg:/etc/haproxy/haproxy.cfg \
    ekino/haproxy:base

Start with compose

But when using the equivalent with docker compose as below :

loadbalancer:
  image: ekino/haproxy:base
  ports:
    - 443:443
  environment:
    - constraint:env==integ
    - constraint:type==dmz
  volumes:
    - /data/etc/haproxy/haproxy.cfg:/etc/haproxy/haproxy.cfg

Then I get the following error message :

Cannot start container c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024: container cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4 not found, impossible to mount its volumes

Analyze

In swarm log file I see :

~: grep -B 1 c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024 swarm.log
time="2015-04-09T19:12:53+02:00" level=info msg="HTTP request received" method=POST uri="/v1.14/containers/create" 
time="2015-04-09T19:12:53+02:00" level=info msg="HTTP request received" method=GET uri="/v1.14/containers/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024/json" 
time="2015-04-09T19:12:53+02:00" level=info msg="HTTP request received" method=POST uri="/v1.14/containers/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024/start" 
time="2015-04-09T19:12:53+02:00" level=debug msg="Proxy request" method=POST url=http://x.x.x.x:2375/v1.14/containers/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024/start 
Cannot start container c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024: container cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4 not found, impossible to mount its volumes

And when checking on the nodes I see the two layers are not on the same node :

~: for i in $(seq 1 4); do echo "==> node$i" ; ssh node$i "sudo find /var/lib/docker -name c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024 -or -name cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4"; echo; done

==> node1

==> node2
/var/lib/docker/aufs/diff/cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4
/var/lib/docker/aufs/mnt/cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4
/var/lib/docker/aufs/layers/cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4
/var/lib/docker/execdriver/native/cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4
/var/lib/docker/containers/cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4

==> node3

==> node4
/var/lib/docker/aufs/diff/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024
/var/lib/docker/aufs/mnt/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024
/var/lib/docker/aufs/layers/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024
/var/lib/docker/containers/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024

And my swarm state is :

grep -Ern "c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024|cd61f82f02acd777e8fb3b4d348b77d50a5a156692ae77229e34395470a586a4" .swarm/state/
.swarm/state/c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024.json:2:    "ID": "c5d1a59cd4e8b62ed707a7a1b8f313f4f264de0196432d3d417cea9d3914e024",

Side note : Before retrying I had removed all the containers with the docker rm -fv $(docker ps -aq) + wiped my .swarm/ folder, so the above is supposed to be a clean run

@frntn
Copy link
Author

frntn commented Apr 9, 2015

Digging into it right now. Think I have a lead.
I'll keep you updated

@frntn
Copy link
Author

frntn commented Apr 10, 2015

I have tcpdumped the HTTP requests between :

  1. docker cli -> swarm
  2. docker compose -> swarm

(The details for the run command and the content of the the yaml are available in my comment above)

docker cli -> swarm

The command is

for i in $(1 42); do 
docker kill integ_reverseproxy
docker rm -f integ_reverseproxy
docker run integ_reverseproxy...
done

Every loop is successful and the workflow is the following

all the runs

POST   /v1.17/containers/integ_reverseproxy/kill?signal=KILL HTTP/1.1
DELETE /v1.17/containers/integ_reverseproxy?force=1 HTTP/1.1
POST   /v1.17/containers/4b92ff1e02f4476f988241455f41eb5f1019dc1ed087eadb8066c7d4ea0c9e5f/start HTTP/1.1

docker compose -> swarm

The command is

for i in $(seq 1 42); do
docker-compose -p integ up -d
done

The workflows are available below ( when relevant : the >> is the data posted along with the request / the << is the http code in return)

first up

POST /v1.14/containers/create?name=integ_reverseproxy_1 HTTP/1.1
    >> {"Tty": false, "NetworkDisabled": false, "Image": "ekino/haproxy:base", "StdinOnce": false, "AttachStdin": false, "Env": ["constraint:env==integ", "constraint:type==dmz"], "Memory": 0, "MemorySwap": 0, "ExposedPorts": {"443/tcp": {}}, "AttachStderr": false, "AttachStdout": false, "OpenStdin": false}
    << HTTP 201
GET  /v1.14/containers/bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04/json HTTP/1.1
POST /v1.14/containers/bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04/start HTTP/1.1

failed reup

It fails because the create is catch by any other node but the original (while the constraint point to only 1 of my 4 nodes)

POST /v1.14/containers/bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04/stop?t=10 HTTP/1.1
POST /v1.14/containers/create HTTP/1.1
GET  /v1.14/containers/6d159b9cb4d5241eec87909fa91c3b7ee8f6f4f24959c77b52af14ca63f2c747/json HTTP/1.1
POST /v1.14/containers/6d159b9cb4d5241eec87909fa91c3b7ee8f6f4f24959c77b52af14ca63f2c747/start HTTP/1.1
    >> {"VolumesFrom": ["bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04"]}
    << HTTP 406 : "Cannot start container 6d159b9cb4d5241eec87909fa91c3b7ee8f6f4f24959c77b52af14ca63f2c747: container bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04 not found, impossible to mount its volumes"

success reup

It works because the create is catch by the original node

POST /v1.14/containers/bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04/stop?t=10 HTTP/1.1
POST /v1.14/containers/create HTTP/1.1
GET  /v1.14/containers/728d245b822528bb74662fbbc301d36d9f82ff4598f9ad3d57cd77be88a68947/json HTTP/1.1
POST /v1.14/containers/728d245b822528bb74662fbbc301d36d9f82ff4598f9ad3d57cd77be88a68947/start HTTP/1.1
    >> {"VolumesFrom": ["bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04"]}
    << HTTP 204

POST   /v1.14/containers/728d245b822528bb74662fbbc301d36d9f82ff4598f9ad3d57cd77be88a68947/wait HTTP/1.1
DELETE /v1.14/containers/bf39b64fb8b47d2fe27a0735c990a35dee87d0a3faa839422f15adcec702ef04?force=False&link=False&v=False HTTP/1.1
POST   /v1.14/containers/create?name=integ_reverseproxy_1 HTTP/1.1
    >> {"Tty": false, "NetworkDisabled": false, "Image": "ekino/haproxy:base", "StdinOnce": false, "AttachStdin": false, "Env": ["constraint:env==integ", "constraint:type==dmz"], "Memory": 0, "MemorySwap": 0, "ExposedPorts": {"443/tcp": {}}, "AttachStderr": false, "AttachStdout": false, "OpenStdin": false}
    << HTTP 201
GET    /v1.14/containers/b8cae4c07891e61a4297d75b47a631f496e5413e1c53b5cc756eeb2c84b9fcc1/json HTTP/1.1
POST   /v1.14/containers/b8cae4c07891e61a4297d75b47a631f496e5413e1c53b5cc756eeb2c84b9fcc1/start HTTP/1.1
DELETE /v1.14/containers/728d245b822528bb74662fbbc301d36d9f82ff4598f9ad3d57cd77be88a68947?force=False&link=False&v=False HTTP/1.1

The workflow is internal and specific to docker compose.
I have not yet dig into compose code to understand the HTTP requests workflow : why do it try to create a dumb container first (2nd request ...without the constraints)

@dnephin
Copy link

dnephin commented Apr 10, 2015

Ah, this sounds like an issue that would be resolved by #874

@frntn
Copy link
Author

frntn commented Apr 10, 2015

Indeed it seems so.
I now understand the workflow of creating a dumb container was to overcome the lack of container renaming before docker 1.5

It is a disruptive change and it may not be that trivial...

For now I will go back to my shell scripts, and keep an eye on this issue before further compose integration in my projects. Thanks everyone !

@frntn
Copy link
Author

frntn commented Apr 11, 2015

Workaround

As I now understand the issue I have managed to apply this very simple workaround : kill and remove all container so compose will create instead of recreate the containers (i.e. no intermediate dumb container, without the swarm constraints, is spun up to other hosts)

docker-compose -p integ kill -s SIGKILL
docker-compose -p integ rm --force
docker-compose -p integ up -d

Now everything's working great \o/ Thanks ! :)


I keep this issue open as this is just a workaround.

@frntn
Copy link
Author

frntn commented May 9, 2015

#1349 is now merged.
I will try it out ASAP

@dnephin
Copy link

dnephin commented Sep 18, 2015

Any update? was this fixed?

@dnephin dnephin added the swarm label Feb 3, 2016
@dnephin
Copy link

dnephin commented Feb 3, 2016

Closing since there was no response (and I believe it to be fixed).

@dnephin dnephin closed this as completed Feb 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants