Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to stop a jobs allocations #9376

Closed
ghost opened this issue Nov 17, 2020 · 7 comments
Closed

Unable to stop a jobs allocations #9376

ghost opened this issue Nov 17, 2020 · 7 comments

Comments

@ghost
Copy link

ghost commented Nov 17, 2020

Nomad version

1.0.0-beta3

Operating system and Environment details

Amazon Linux 2
Linux ip-172-31-3-147.service.consul 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

unable to stop a nomad job allocation for a task

$ nomad job status document
ID            = document
Name          = document
Submit Date   = 2020-11-15T23:53:41-05:00
Type          = service
Priority      = 50
Datacenters   = airside-stage
Namespace     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
document    0       0         2        0       2         0

Latest Deployment
ID          = d90b6893
Status      = failed
Description = Failed due to progress deadline - rolling back to job version 39

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
document    true         2        1       1        0          2020-11-17T13:19:48Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
19d6d865  97e475a7  document    40       stop     complete  17m57s ago  7m9s ago
b7936b1e  97e475a7  document    38       stop     running   1d33m ago   17m58s ago
8ce211b5  12acf690  document    38       stop     running   1d8h ago    17m58s ago

$ nomad alloc stop b7936b1e
==> Monitoring evaluation "8041a0fd"
    Evaluation triggered by job "document"
    Evaluation within deployment: "d90b6893"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "8041a0fd" finished with status "complete"

$ nomad alloc stop 8ce211b5
==> Monitoring evaluation "81877150"
    Evaluation triggered by job "document"
    Evaluation within deployment: "d90b6893"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "81877150" finished with status "complete"

steve.clark at Steves-C in ~
$ nomad job status document
ID            = document
Name          = document
Submit Date   = 2020-11-15T23:53:41-05:00
Type          = service
Priority      = 50
Datacenters   = airside-stage
Namespace     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
document    0       0         2        0       2         0

Latest Deployment
ID          = d90b6893
Status      = failed
Description = Failed due to progress deadline - rolling back to job version 39

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
document    true         2        1       1        0          2020-11-17T13:19:48Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
19d6d865  97e475a7  document    40       stop     complete  18m30s ago  7m42s ago
b7936b1e  97e475a7  document    38       stop     running   1d33m ago   18m31s ago
8ce211b5  12acf690  document    38       stop     running   1d8h ago    18m31s ago

let me know if you need anything else to help with this. i also opened #9375

@notnoop
Copy link
Contributor

notnoop commented Nov 17, 2020

Thanks @oopstrynow for the details; we'll investigate. I've edited the formatting of bash snippets for more clarity. We'll investigate.

It'd be useful to include nomad alloc status on the allocations that are supposed to stop but still running.

@ghost
Copy link
Author

ghost commented Nov 17, 2020

here is a dead parent task with a pending sidecar task.

$ nomad alloc status 8cb64f91
ID                   = 8cb64f91-bee5-fae1-af5d-dbc2ef7c0c83
Eval ID              = fed9dee9
Name                 = traefik.traefik[0]
Node ID              = 882c6bed
Node Name            = ip-172-31-1-127.service.consul
Job ID               = traefik
Job Version          = 9
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being updated due to job update
Created              = 1d15h ago
Modified             = 2h52m ago
Replacement Alloc ID = babc19de

Task "filebeat-sidecar" (prestart sidecar) is "running"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  500 MiB  300 MiB  http: 172.31.1.127:5067

Task Events:
Started At     = 2020-11-16T04:00:04Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type              Description
2020-11-17T06:39:25-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2020-11-17T06:39:25-05:00  Leader Task Dead  Leader Task in Group dead
2020-11-17T06:39:25-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2020-11-17T06:39:25-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2020-11-17T06:39:25-05:00  Leader Task Dead  Leader Task in Group dead
2020-11-15T23:00:04-05:00  Started           Task started by client
2020-11-15T23:00:03-05:00  Task Setup        Building Task Directory
2020-11-15T23:00:03-05:00  Received          Task received by client
2020-11-15T17:44:58-05:00  Task Setup        Building Task Directory
2020-11-15T17:44:57-05:00  Received          Task received by client

Task "traefik" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  128 MiB  300 MiB  http: 172.31.1.127:8888
                           api: 172.31.1.127:8081

Task Events:
Started At     = 2020-11-15T22:44:59Z
Finished At    = 2020-11-17T11:39:25Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2020-11-17T06:39:25-05:00  Killing     Sent interrupt. Waiting 5s before force killing
2020-11-17T06:39:25-05:00  Killing     Sent interrupt. Waiting 5s before force killing
2020-11-17T06:39:25-05:00  Killed      Task successfully killed
2020-11-17T05:41:03-05:00  Killing     Sent interrupt. Waiting 5s before force killing
2020-11-15T23:00:04-05:00  Started     Task started by client
2020-11-15T23:00:03-05:00  Task Setup  Building Task Directory
2020-11-15T23:00:03-05:00  Received    Task received by client
2020-11-15T17:44:59-05:00  Started     Task started by client
2020-11-15T17:44:58-05:00  Task Setup  Building Task Directory
2020-11-15T17:44:57-05:00  Received    Task received by client

@notnoop
Copy link
Contributor

notnoop commented Nov 17, 2020

Thanks for the info. I'm unable to reproduce this case I'm afraid. We'd like a couple more info:

If you have a nomad client still running with one of these stuck jobs, can you try to kill it with SIGQUIT or SIGABRT (e.g. kill -SIGABRT <nomad agent pid>? Having the logs as well as the stacktrace that gets generated for both issues would be helpful.

Also, I noticed that filebeat-sidecar is marked as a prestart sidecar, but in #34196 , the task didn't have a lifecycle stanza. Was the lifecycle stanza dropped redacting? If not, I suspect we'd have a parsing problem.

@ghost
Copy link
Author

ghost commented Nov 17, 2020

i have a nomad client still running with one of the filebeat-sidecar docker job.

44e85defb82f        airside-docker.jfrog.io/base-images/nomad-filebeat:7.9.3                   "/usr/local/bin/dock…"   43 hours ago        Up 43 hours         172.31.1.127:5067->5067/tcp, 172.31.1.127:5067->5067/udp                                     filebeat-sidecar-8cb64f91-bee5-fae1-af5d-dbc2ef7c0c83

i tried $docker kill 44e85defb82f but that just end up restarting the job. i am not familiar with the SIGQUIT or SIGABRT command.

[root@ip-172-31-1-127 ~]# tail -1000 /var/log/messages | grep "failed to exit"
Nov 17 17:55:54 ip-172-31-1-127 dockerd: time="2020-11-17T17:55:54.274371610Z" level=info msg="Container e34b0d412318f5613c6c9abb09c7d4ed9bea26f802141a36520028cb7525bd73 failed to exi within 0 seconds of signal 15 - using the force"
Nov 17 17:55:59 ip-172-31-1-127 dockerd: time="2020-11-17T17:55:59.848302660Z" level=info msg="Container 44e85defb82f failed to exit within 10 seconds of kill - trying direct SIGKILL"
Nov 17 17:56:05 ip-172-31-1-127 dockerd: time="2020-11-17T17:56:05.276228633Z" level=info msg="Container e34b0d412318 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Nov 17 17:56:57 ip-172-31-1-127 dockerd: time="2020-11-17T17:56:57.639890290Z" level=info msg="Container e34b0d412318f5613c6c9abb09c7d4ed9bea26f802141a36520028cb7525bd73 failed to exi within 0 seconds of signal 15 - using the force"
Nov 17 17:57:08 ip-172-31-1-127 dockerd: time="2020-11-17T17:57:08.411040034Z" level=info msg="Container e34b0d412318 failed to exit within 10 seconds of kill - trying direct SIGKILL"

@notnoop
Copy link
Contributor

notnoop commented Nov 30, 2020

Thanks @oopstrynow . Thanks, that's pretty helpful. For the SIGABRT part, I mistyped, I meant run kill -SIGABRT <nomad pid>, the pid number of the main nomad process. pkill -SIGABRT nomad would do too (but may cause leaked processes, so you may want to restart the computer after that). The nomad process will output a golang stacktrace that will be very helpful for debugging the issue. Thanks!

@tgross
Copy link
Member

tgross commented Dec 14, 2020

Closing as per #9375 (comment)

@tgross tgross closed this as completed Dec 14, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants