Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad stop exits before allocations have exited #1225

Closed
ghost opened this issue Jun 1, 2016 · 7 comments
Closed

Nomad stop exits before allocations have exited #1225

ghost opened this issue Jun 1, 2016 · 7 comments
Labels
theme/docs Documentation issues and enhancements type/bug

Comments

@ghost
Copy link

ghost commented Jun 1, 2016

Nomad version

Nomad v0.3.2

Operating system and Environment details

Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) x86_64 GNU/Linux

Issue

I'm experiencing issues where nomad stop exits before the docker container has exited. This means that when we run nomad run the nomad client experiences a port collision because the old container is still running. This appears to be related to containers that cannot or do not exit from SIGTERM and require a subsequent SIGKILL after the Nomad exit timeout (as shown in the logs below).

Nomad Client logs (if appropriate)

Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] http: Request /v1/jobs?prefix=example (2.572682ms)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] http: Request /v1/job/example (2.090988ms)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] http: Request /v1/job/example (15.035108ms)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] http: Request /v1/evaluation/7db80b44-dcc9-79d6-d4ee-1c36832de9c0 (1.28188ms)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] http: Request /v1/evaluation/7db80b44-dcc9-79d6-d4ee-1c36832de9c0/allocations (920.807µs)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] client: updated allocations at index 10059 (pulled 1) (filtered 19)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 19)
Jun  1 14:15:19 nomad[1335]: 2016/06/01 14:15:19 [DEBUG] client: updated allocations at index 10061 (pulled 0) (filtered 20)
Jun  1 14:15:20 nomad[1335]: 2016/06/01 14:15:20 [DEBUG] client: updated allocations at index 10062 (pulled 0) (filtered 20)
Jun  1 14:15:20 nomad[1335]: 2016/06/01 14:15:20 [DEBUG] http: Request /v1/evaluation/7db80b44-dcc9-79d6-d4ee-1c36832de9c0 (1.850422ms)
Jun  1 14:15:20 nomad[1335]: 2016/06/01 14:15:20 [DEBUG] http: Request /v1/evaluation/7db80b44-dcc9-79d6-d4ee-1c36832de9c0/allocations (938.222µs)


// Nomad stop returns here, but container is still running


Jun  1 14:15:24 docker[1264]: time="2016-06-01T14:15:24.739442325Z" level=info msg="Container cfd81a52b31dbf6d8dc9068e6cc8da4f5fda015db5e9995c9c84d1df64fb5023 failed to exit within 5 seconds of signal 15 - using the force"
Jun  1 14:15:24 docker[1264]: time="2016-06-01T14:15:24.799329877Z" level=error msg="restartmanager canceled"
Jun  1 14:15:24 kernel: [ 5528.132996] docker0: port 1(veth75e91bf) entered disabled state
Jun  1 14:15:24 systemd-sysctl[5158]: Overwriting earlier assignment of net/ipv4/ip_forward in file '/etc/sysctl.d/99-salt.conf'.
Jun  1 14:15:24 kernel: [ 5528.145357] docker0: port 1(veth75e91bf) entered disabled state
Jun  1 14:15:24 kernel: [ 5528.153813] device veth75e91bf left promiscuous mode
Jun  1 14:15:24 kernel: [ 5528.159586] docker0: port 1(veth75e91bf) entered disabled state
Jun  1 14:15:24 nomad[1335]: 2016/06/01 14:15:24 [INFO] driver.docker: stopped container cfd81a52b31dbf6d8dc9068e6cc8da4f5fda015db5e9995c9c84d1df64fb5023
Jun  1 14:15:24 nomad[1335]: 2016/06/01 14:15:24 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 20)
Jun  1 14:15:24 consul[662]: 2016/06/01 14:15:24 [INFO] agent: Deregistered service 'nomad-registered-service-b7f6b211-9b7b-926c-0f2c-6c0dfee51a33-example-669b212cd4a8b30f21f6558ee09b91d5546c64cd'
Jun  1 14:15:24 nomad[1335]: 2016/06/01 14:15:24 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 20)
Jun  1 14:15:24 nomad[1335]: 2016/06/01 14:15:24 [DEBUG] plugin: /usr/local/bin/nomad: plugin process exited
Jun  1 14:15:24 docker[1264]: time="2016-06-01T14:15:24.915901330Z" level=error msg="Handler for POST /containers/cfd81a52b31dbf6d8dc9068e6cc8da4f5fda015db5e9995c9c84d1df64fb5023/stop returned error: Container cfd81a52b31dbf6d8dc9068e6cc8da4f5fda015db5e9995c9c84d1df64fb5023 is already stopped"
Jun  1 14:15:25 nomad[1335]: 2016/06/01 14:15:25 [DEBUG] client: updated allocations at index 10063 (pulled 0) (filtered 20)
Jun  1 14:15:25 nomad[1335]: 2016/06/01 14:15:25 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 20)
@diptanu
Copy link
Contributor

diptanu commented Jun 3, 2016

@grobinson-blockchain That's how the docker driver works. First, it sends the SIGTERM signal and after the timeout, it sends the SIGKILL signal to the container if it's still running.

nomad stop is not supposed to block until the container exits, it simply changes the desired state of the job and the client takes soon after that. The scheduler doesn't think the resources used by the old allocations has been freed up until the client informs the scheduler that the task has been killed successfully.

@ghost
Copy link
Author

ghost commented Jun 3, 2016

nomad stop is not supposed to block until the container exits, it simply changes the desired state of the job and the client takes soon after that.

@diptanu it's just that the documentation says otherwise (https://www.nomadproject.io/docs/commands/stop.html):

Upon successful deregistration, an interactive monitor session will start to display log lines as the job unwinds its allocations and completes shutting down. The monitor will exit once all allocations are stopped and the job has reached a terminal state.

@diptanu
Copy link
Contributor

diptanu commented Jun 3, 2016

@grobinson-blockchain Sorry about that, we will update the docs.

What actually happens here is that Nomad creates an evaluation when the stop api is called, and waits until the scheduler processes the evaluation. Once the evaluation is processed and the allocation's desired status has been updated the stop returns.

@diptanu diptanu added theme/docs Documentation issues and enhancements type/bug labels Jun 3, 2016
@camerondavison
Copy link
Contributor

As far as the "experiences a port collision because the old container is still running" that seems like a bug? If the port was allocated as part of the nomad task. I guess it is possible that it was not. That bug though seems related to #1191

@ghost
Copy link
Author

ghost commented Jun 8, 2016

@a86c6f7964 yes, the port allocation was from a previous allocation of the same Nomad job. I'm simply updating the docker image to run a later build.

@dadgar
Copy link
Contributor

dadgar commented Nov 4, 2016

@grobinson-blockchain Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/docs Documentation issues and enhancements type/bug
Projects
None yet
Development

No branches or pull requests

4 participants