Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes nomad may place task with docker driver on host where docker not avaliable #5485

Open
tantra35 opened this issue Mar 28, 2019 · 5 comments
Labels
stage/needs-verification Issue needs verifying it still exists theme/client type/bug

Comments

@tantra35
Copy link
Contributor

tantra35 commented Mar 28, 2019

Nomad version

Nomad v0.8.6 (ab54ebc+CHANGES)

Issue

In our environment we have hosts where docker not installed, an we observed that nomad may place tack with docker driver on nodes where docker driver not present

Here is example for node with ID 8be15688 (there is only exec,raw_exec drivers presnt on it)

ruslan@ip-172-30-0-53:~$ nomad status 8be15688
ID            = 8be15688
Name          = ip-172-30-2-131
Class         = openresty
DC            = aws
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 2178h36m4s
Driver Status = exec,raw_exec

Node Events
Time                       Subsystem  Message
2019-01-30T17:22:22+03:00  Cluster    Node reregistered by heartbeat
2019-01-30T17:22:22+03:00  Cluster    Node heartbeat missed
2018-12-27T22:03:12+03:00  Cluster    Node registered

Allocated Resources
CPU           Memory           Disk            IOPS
800/5900 MHz  1.9 GiB/3.4 GiB  300 MiB/14 GiB  0/0

Allocation Resource Utilization
CPU            Memory
1109/5900 MHz  743 MiB/3.4 GiB

Host Resource Utilization
CPU            Memory           Disk
1507/6000 MHz  1.3 GiB/3.6 GiB  1.6 GiB/15 GiB

Allocations
ID        Node ID   Task Group       Version  Desired  Status   Created     Modified
307d08bd  8be15688  dynamotoelastic  22       run      failed   6m46s ago   6m8s ago
0f374e3c  8be15688  nginx            2        run      running  3mo18h ago  2mo27d ago

As you can see allocation 307d08bd have failed state, with follow events for it

Recent Events:
Time                       Type            Description
2019-03-28T16:32:01+03:00  Not Restarting  Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2019-03-28T16:32:01+03:00  Driver Failure  failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:32:01+03:00  Driver          Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:43+03:00  Restarting      Task restarting in 18.228554529s
2019-03-28T16:31:43+03:00  Driver Failure  failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:31:43+03:00  Driver          Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:28+03:00  Restarting      Task restarting in 15.223446572s
2019-03-28T16:31:28+03:00  Driver Failure  failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-41d7-164c00e4c99e": Failed to pull `027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17`: dial unix /var/run/docker.sock: connect: no such file or directory
2019-03-28T16:31:28+03:00  Driver          Downloading image 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoelastic:p17
2019-03-28T16:31:28+03:00  Task Setup      Building Task Directory

this allocation require docker driver and it is logical that it could not run on the node without Docker, finally nomad reschedule policy can place task on right node, but i think that is fully wrong that it even not must trying to place task with docker dependency on node without docker driver

@schmichael
Copy link
Member

You're absolutely right that this should not happen! Thanks for the report!

Classifying this as a client bug for now as my suspicion is that it's a bug when reregistering the node after it missed a heartbeat.

Does this only occur on nodes with "heartbeat missed" node events? Do you have client logs from one of the time periods when the heartbeat was missed?

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 28, 2019

@schmichael I doesn't think that this is dues missed heatbeats. On node this happens at 30 January 2019

2019-01-30T17:22:22+03:00  Cluster    Node heartbeat missed

But when we observe this issue was 28 March 2019, so between this event about 2 month of difference

And here logs from client 8be15688(where allocation was placed and where no docker driver). We took time from 2019-03-28 15:00 which i think enough, to capture events, that was thrown when issue happened

ruslan@ip-172-30-2-131:~$ sudo journalctl -u nomad -S"2019-03-28 15:00"
-- Logs begin at Fri 2019-03-08 07:37:08 MSK, end at Fri 2019-03-29 00:11:04 MSK. --
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:31:28.338687 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:31:28.339683 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.464+0300 [DEBUG] plugin: starting plugin: path=/opt/nomad/nomad_0.8.4/nomad args="[/opt/nomad/nomad_0.8.4/nomad executor
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.466+0300 [DEBUG] plugin: waiting for RPC address: path=/opt/nomad/nomad_0.8.4/nomad
Mar 28 16:31:28 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:31:28.565+0300 [DEBUG] plugin.nomad: plugin address: timestamp=2019-03-28T16:31:28.565+0300 address=/tmp/plugin240231881 networ
Mar 28 16:31:43 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:31:43.563837 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:31:43 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:31:43.563867 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:32:01.793287 [ERR] driver.docker: failed pulling container 027482006778.dkr.ecr.us-east-1.amazonaws.com/playrix/dynamodbtoe
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]:     2019/03/28 16:32:01.793323 [WARN] client: error from prestart: failed to initialize task "dynamotoelastic" for alloc "307d08bd-12c4-5951-
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:32:01.811+0300 [DEBUG] plugin.nomad: 2019/03/28 16:32:01 [ERR] plugin: plugin server: accept unix /tmp/plugin240231881: use of
Mar 28 16:32:01 ip-172-30-2-131 nomad.sh[1290]: 2019-03-28T16:32:01.812+0300 [DEBUG] plugin: plugin process exited: path=/opt/nomad/nomad_0.8.4/nomad

@schmichael
Copy link
Member

@tantra35 Interesting, thanks for posting that. Can you grep for other log lines containing "docker" -- especially directly before the Docker task was scheduled on the node?

Have these hosts ever had Docker installed?

@tantra35
Copy link
Contributor Author

tantra35 commented Mar 29, 2019

@schmichael we hev actual logs from 08-03-2019 (8 March 2019)

ruslan@ip-172-30-2-131:~$ sudo journalctl -u nomad --no-pager
-- Logs begin at Fri 2019-03-08 07:37:08 MSK, end at Fri 2019-03-29 14:25:12 MSK. --

too old logs already rotated, and there only messages contains "docker" that I mention before, no other.

And on those host docker never was been installed

@notnoop
Copy link
Contributor

notnoop commented Sep 26, 2019

@tantra35 It fear like #6227 might be different. Would you be able to confirm dynamotoelastic taskgroup allocation had multiple tasks requiring exec or raw_exec driver?

#6227 fixes the cases with multiple required drivers for the alloc - I'm unaware of bugs affecting allocs with a single required driver.

@tgross tgross added stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation labels Jan 26, 2021
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-verification Issue needs verifying it still exists theme/client type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants