Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

jozef-slezak · 2019-07-04T12:55:37Z

We are restarting nomad cluster (3 nomad servers and tens of nomad clients all physical machines). Same test scenario #5917, #5908 but a different bug report

Please check if this is related to #5669

Nomad version

0.9.3

Operating system and Environment details

Linux CentOS

Issue

After restarting whole cluster (not every time but it can be reproduces by repeating restarts) we see higher number of instances then expected (92 instead Count = 70). It does not respect counts and constraints.

Reproduction steps

Restart all nomad servers and clients (sudo reboot).
Check Nomad console - it seems that some allocations were "started twice"

cgbaker · 2019-07-05T17:49:06Z

I have verified this (even on a small cluster with a few clients), against Nomad 0.9.3

cgbaker · 2019-07-05T21:16:08Z

@jozef-slezak , I was not able to reproduce this with the attached latest build of Nomad. This build includes the following PR, which may have been the fix for this. If you are willing to test with this, it would be helpful.
11afd990a7b76a9909c2b0328f17381ad3d27bef.zip

jozef-slezak · 2019-07-05T21:36:08Z

Sure, I can test that on Monday.
@cgbaker, could you please attach somewhere the binary that I need to use during the testing?

cgbaker · 2019-07-05T22:15:17Z

https://github.com/hashicorp/nomad/files/3364021/11afd990a7b76a9909c2b0328f17381ad3d27bef.zip

jozef-slezak · 2019-07-08T09:45:40Z

It definitely does NOT work as expected. I reproduced the same behavior by using binary https://github.com/hashicorp/nomad/files/3364021/11afd990a7b76a9909c2b0328f17381ad3d27bef.zip.

After doing that, I believe that I better understand what is happening. My previously failed job allocations were started after cluster restart (this seems to be the root cause - please try fixing here). My job count is 22. But nomad started 86 which caused out of memory (OOM killer was not configured therefore one server is not responding anymore). It did not even respect the memory quota (not only count/constraint).

nomad job status my-service | grep my | grep running | wc -l
86

Steps to reproduce:

Step 1. I submitted a job (first version) that failed after start (I see some application error in logs)
Step 2. I have submitted a serveral version of the same HCL (just changing the count - still failing)
Step 3. Then I fixed my specific application configuration (JSON) and it started to work after I submitted HCL (modified count from 10 to 12 - two allocs were running - strange because I was expecting 12 - I can report this one later on). Repeated the same step (from count changed from 12 to 22)
Step 4. then I stopped whole 3node cluster by using

sudo systemctl stop nomad

Step 5. I waited until nomad stopped on all servers (one nomad instance - leader was killed) and than started

sudo systemctl start nomad

Stop 6. Nomad started too many allocations (86 which caused OOM killing...)

Step 3 logs (HCL job count 10 -> 12):

Submit Date   = 2019-07-08T11:12:10+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
my-service  0       0         2        100     0         0

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created     Modified
3d342d09  bfcde0e5  my-service  9        run      running  2m11s ago   1m55s ago
2302e877  bfcde0e5  my-service  9        run      running  2m11s ago   1m52s ago
3282d05d  bfcde0e5  my-service  6        run      failed   10m29s ago  9m43s ago
d1a5f048  bfcde0e5  my-service  6        run      failed   10m29s ago  9m45s ago
b02faa08  bfcde0e5  my-service  6        run      failed   10m29s ago  9m44s ago
8a3f8220  bfcde0e5  my-service  6        run      failed   10m29s ago  9m45s ago
2d278490  bfcde0e5  my-service  6        run      failed   10m43s ago  10m1s ago
0050485b  bfcde0e5  my-service  6        run      failed   10m44s ago  10m1s ago
77662e2a  bfcde0e5  my-service  6        run      failed   10m44s ago  10m1s ago
d6311878  bfcde0e5  my-service  6        run      failed   10m46s ago  10m1s ago
ec9ba7e1  bfcde0e5  my-service  6        run      failed   10m49s ago  10m7s ago
d949b239  bfcde0e5  my-service  6        run      failed   10m56s ago  10m17s ago

nomad plan my_service.nomad.hcl
+/- Job: "my-service"
+/- Stop: "true" => "false"
    Task Group: "my-service" (2 create, 10 ignore)
      Task: "my-service"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 7657

Step 4 logs:

Jul 08 11:20:23 ip-172-31-39-40.eu-central-1.compute.internal nomad[8054]: ==> Caught signal: terminated
Jul 08 11:20:23 ip-172-31-39-40.eu-central-1.compute.internal nomad[8054]: 2019-07-08T11:20:23.679+0200 [INFO ] agent: requesting shutdown
Jul 08 11:20:23 ip-172-31-39-40.eu-central-1.compute.internal nomad[8054]: 2019-07-08T11:20:23.679+0200 [INFO ] client: shutting down
Jul 08 11:20:23 ip-172-31-39-40.eu-central-1.compute.internal systemd[1]: Stopping HashiCorp Nomad...
...
Jul 08 11:21:54 ip-172-31-39-40.eu-central-1.compute.internal systemd[1]: nomad.service stop-final-sigterm timed out. Killing.
Jul 08 11:21:54 ip-172-31-39-40.eu-central-1.compute.internal systemd[1]: Stopped HashiCorp Nomad.
Jul 08 11:21:54 ip-172-31-39-40.eu-central-1.compute.internal systemd[1]: Unit nomad.service entered failed state.
Jul 08 11:21:54 ip-172-31-39-40.eu-central-1.compute.internal systemd[1]: nomad.service failed.

Later that day I checked cluster status:

date
Mon Jul  8 12:23:08 CEST 2019

nomad server members
Name                                                   Address        Port  Status  Leader  Protocol  Build      Datacenter  Region
ip-172-31-34-128.eu-central-1.compute.internal.global  172.31.34.128  4648  alive   false   2         0.9.4-dev  dc1         global
ip-172-31-36-243.eu-central-1.compute.internal.global  172.31.36.243  4648  alive   true    2         0.9.4-dev  dc1         global
ip-172-31-39-40.eu-central-1.compute.internal.global   172.31.39.40   4648  failed  false   2         0.9.4-dev  dc1         global

nomad job status my-service
ID            = my-service
Name          = my-service
Submit Date   = 2019-07-08T11:18:10+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
my-service  22      0         0        100     2         116

Placement Failure
Task Group "my-service":
  * Constraint "172.31.39.40 set_contains ${attr.unique.network.ip-address}" fil                                                                   tered 2 nodes

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created     Modifi                                                                   ed
86cb8dc8  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
c72b8308  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
30aa7a9e  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
ae1a87b8  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
1042d2e0  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
11cb1d9b  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
d8dd89fe  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
6e3f53c0  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
181f6a0a  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
a60d954b  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
543c5111  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
e540018a  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
c04d42bf  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
1d173d61  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
d9efb17c  bfcde0e5  my-service  11       stop     lost      37m54s ago  37m5s                                                                    ago
c8e841da  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
7b4c05da  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
3bae65ea  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
de6e903f  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
f1569a04  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
d8ee40c6  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
151e0dbc  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
6976a77d  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
2d54fd3f  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
f2513dde  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
fa4dc383  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
98d8e1f8  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
8344d24f  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
545c59aa  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
4431e900  bfcde0e5  my-service  11       stop     lost      44m56s ago  41m6s                                                                    ago
c376cc54  bfcde0e5  my-service  11       stop     lost      1h1m ago    41m6s                                                                    ago
59cd3c15  bfcde0e5  my-service  11       stop     lost      1h1m ago    41m6s                                                                    ago
c621df4f  bfcde0e5  my-service  11       stop     lost      1h1m ago    41m6s                                                                    ago
46312b13  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
1d322ca2  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
8c5126da  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
e04899f6  bfcde0e5  my-service  11       stop     lost      1h3m ago    51m24s                                                                    ago
d98cfa5b  bfcde0e5  my-service  11       stop     lost      1h3m ago    51m24s                                                                    ago
bb9d16ef  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
a36bfa7e  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
1a4d7898  bfcde0e5  my-service  11       stop     lost      1h3m ago    51m24s                                                                    ago
9781f299  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
7456ffbf  bfcde0e5  my-service  11       stop     lost      1h3m ago    51m24s                                                                    ago
576a8487  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
ff5cc4ea  bfcde0e5  my-service  11       stop     lost      1h3m ago    41m6s                                                                    ago
2302e877  bfcde0e5  my-service  9        stop     complete  1h9m ago    57m29s                                                                    ago
3d342d09  bfcde0e5  my-service  9        stop     complete  1h9m ago    57m29s                                                                    ago
d1a5f048  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
3282d05d  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
b02faa08  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
8a3f8220  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
2d278490  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
0050485b  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
77662e2a  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
d6311878  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
ec9ba7e1  bfcde0e5  my-service  6        stop     failed    1h17m ago   57m29s                                                                    ago
d949b239  bfcde0e5  my-service  6        stop     failed    1h18m ago   57m29s                                                                    ago
56d6dfa7  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
ab69ca24  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
7e18aa9a  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
502152a0  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
69907492  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
7af016d3  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
3335e5ed  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
80e93b1f  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
a8e0934d  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m29s                                                                    ago
0d849ff5  bfcde0e5  my-service  3        stop     failed    1h26m ago   57m38s                                                                    ago
9456b640  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
99df3b5f  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
5924b6a0  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
1cde7009  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
6cc7cd40  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
0bef9687  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
ec309d16  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
19df7a40  bfcde0e5  my-service  3        stop     lost      1h31m ago   41m6s                                                                    ago
b600cc7e  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
a2ef3bf2  bfcde0e5  my-service  3        stop     failed    1h31m ago   57m29s                                                                    ago
5d527319  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
efcba314  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
4a6e1fd4  bfcde0e5  my-service  3        stop     lost      1h33m ago   51m24s                                                                    ago
d80f7931  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
66f017d9  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
76de41b5  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
e22d4e79  bfcde0e5  my-service  3        stop     lost      1h33m ago   41m6s                                                                    ago
75f75f47  bfcde0e5  my-service  3        stop     lost      1h34m ago   41m6s                                                                    ago
f862a719  bfcde0e5  my-service  3        stop     lost      1h34m ago   41m6s                                                                    ago
c5136e6c  bfcde0e5  my-service  3        stop     lost      1h34m ago   51m24s                                                                    ago
262b4326  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
624aa868  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
932d8366  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
a3ba290a  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
db53727b  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
afae12b6  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
ea25327b  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
51b21594  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
7e65424c  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
f3ecff2c  bfcde0e5  my-service  3        stop     lost      1h35m ago   41m6s                                                                    ago
bd25270f  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
91172736  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
415d5a20  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
e5ddd01b  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
b18ea7d9  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
45cc912d  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
7d08d4bb  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
8d3f044a  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
d7d70671  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
97e8e84d  bfcde0e5  my-service  3        stop     lost      1h37m ago   41m6s                                                                    ago
2fd34652  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
b4cb796d  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
e2056c1e  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
28b523f3  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
9cf40865  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
3b4d9eec  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
4f995ae9  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
88a57c61  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
cb950e2a  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
e02dbab4  bfcde0e5  my-service  2        stop     lost      1h40m ago   41m6s                                                                    ago
39c7748d  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
598201f4  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
fff79d78  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
43545910  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
0f31d28a  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
48c9927c  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
e207cb54  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
7a23f5e3  bfcde0e5  my-service  0        stop     lost      1h44m ago   37m5s                                                                    ago
a197c6f7  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
1c00fac1  bfcde0e5  my-service  0        stop     lost      1h44m ago   41m6s                                                                    ago
68418045  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
5c6120a5  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
2932df5a  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
22ca76e8  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
3b20e150  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
a9d2da12  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
8078ea9d  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
b670bacc  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
badc6ea7  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
c8c12619  bfcde0e5  my-service  0        stop     lost      1h46m ago   41m6s                                                                    ago
46a62ec6  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
d1383fa1  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
169950f6  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
47af85ec  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
afd3aeeb  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
1df68193  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
74146712  bfcde0e5  my-service  0        stop     lost      1h47m ago   51m24s                                                                    ago
9ddb4a69  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
ff8d713b  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago
1e922951  bfcde0e5  my-service  0        stop     lost      1h47m ago   41m6s                                                                    ago

Even later that day:

date
Mon Jul  8 13:28:56 CEST 2019

nomad job status my-service
ID            = my-service
Name          = my-service
Submit Date   = 2019-07-08T11:18:10+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
my-service  22      0         0        100     2         116

Allocations
No allocations placed

> nomad job status my-service
ID            = my-service
Name          = my-service
Submit Date   = 2019-07-08T11:18:10+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
my-service  22      0         0        100     2         116

Allocations
No allocations placed

> date
Mon Jul  8 13:28:56 CEST 2019

> nomad job status my-service
ID            = my-service
Name          = my-service
Submit Date   = 2019-07-08T11:18:10+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost
my-service  22      0         0        100     2         116

Allocations
No allocations placed

In my opinion, Nomad should not start allocs "blindly" but it should check count/constraint/memory first.

I am thinking about a workaround (minimize the number of allocs that should be garbage collected):

using purge option while stoping https://www.nomadproject.io/docs/commands/job/stop.html#purge
small number for https://www.nomadproject.io/docs/configuration/client.html#gc_max_allocs (related to failover scenario)
putting alloc_dir under /tmp (to be automatically erased after OS reboot) Clients try to rerun old allocations after machine reboot #1795

jozef-slezak · 2019-07-09T11:46:07Z

Today, I was able to simulate this buggy behavior on a single node (just by using sudo systemctl stop nomad and sudo systemctl start nomad). **Different job Count=**1 resource network static port (but it tries to run two instances). Please check the evals below.

nomad job status -evals id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-09T11:01:33+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2     1       0         1        5       1         0

Evaluations
ID        Priority  Triggered By   Status    Placement Failures
eda81ea4  50        queued-allocs  blocked   N/A - In Progress
2c634f0a  50        node-update    complete  false
c1496225  50        alloc-failure  complete  true
c0484453  50        alloc-failure  complete  false
d3cf1047  50        queued-allocs  complete  false
62e61586  50        queued-allocs  canceled  false
d6f3d40d  50        node-update    complete  true
48687e70  50        alloc-failure  complete  true
57fdca0e  50        alloc-failure  complete  false
b692420a  50        alloc-failure  complete  false
fa954963  50        node-update    complete  false
f15c33dc  50        node-update    complete  false
8ca4d820  50        alloc-failure  complete  false
22377e9a  50        alloc-failure  complete  false
d6a07fac  50        node-update    complete  false
7567b1e0  50        alloc-failure  complete  false
2d57a875  50        alloc-failure  complete  false
f60d9b48  50        alloc-failure  complete  false
52645593  50        alloc-failure  complete  false
045e52ed  50        node-update    complete  false
2470c8db  50        node-update    complete  false
38462846  50        alloc-failure  complete  false
8315a6cc  50        alloc-failure  complete  false
c1253f67  50        node-update    complete  false
1318699e  50        node-update    complete  false
ce34b318  50        alloc-failure  complete  false
2944d918  50        node-update    complete  false
39288851  50        alloc-failure  complete  false
63404958  50        node-update    complete  false
f939ea13  50        node-update    complete  false

Placement Failure
Task Group "id2":
  * Resources exhausted on 1 nodes
  * Dimension "network: reserved port collision" exhausted on 1 nodes

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
a5bb8943  d2ff5fcf  id2     1        run      failed    8m10s ago   5m49s ago
4fca7634  d2ff5fcf  id2     1        run      failed    16m45s ago  5m48s ago
f7919506  d2ff5fcf  id2     1        run      running   19m50s ago  4m27s ago
d545f3d4  d2ff5fcf  id2     1        run      failed    25m56s ago  5m50s ago
2af84659  d2ff5fcf  id2     1        run      complete  30m46s ago  5m48s ago
e4f30cbf  d2ff5fcf  id2     1        run      failed    1h58m ago   5m49s ago

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "10.0.2.15"
  rpc = "10.0.2.15"
  serf = "10.0.2.15"
}
bind_addr = "10.0.2.15"
client = {
  enabled = true
  network_interface = "enp0s3"
  options = {
    driver.raw_exec.enable = 1
  }
}
enable_syslog = true
data_dir = "/var/lib/nomad"
datacenter = "dc1"
disable_update_check = true
log_level = "INFO"
server = {
  bootstrap_expect = 1
  enabled = true
  encrypt = "xxx"
}

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "id2",
  "ParentID": "",
  "Name": "id2",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "id2",
      "Count": 1,
...
"Networks": [
              {
                "Device": "",
                "CIDR": "",
                "IP": "",
                "MBits": 10,
                "ReservedPorts": [
                  {
                    "Label": "id2_web_port",
                    "Value": 3001
                  }
                ],
                "DynamicPorts": null
              }
            ],

cgbaker · 2019-07-09T21:43:11Z

similar to #5919, the latest build of Nomad 0eae387a96d6482cece2a8aa51f4aa8d8616549 is not manifesting this for me on a single node cluster. i have been testing using the docker driver (i see from your latest comment that you are using raw_exec). i will monitor #5919 and see how your testing progresses.

jozef-slezak · 2019-07-10T15:30:57Z

I am going to retests with 0eae387a96d6482cece2a8aa51f4aa8d8616549. Could you please attach to this issue the build 0eae387a96d6482cece2a8aa51f4aa8d8616549 with the nomad UI?

jozef-slezak · 2019-07-10T15:53:36Z

nomad version
Nomad v0.9.4-dev (0eae387)

Job Count=1

nomad job status id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-09T11:01:33+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2     0       0         2        5       2         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6e78aad6  d2ff5fcf  id2     1        run      running  3m59s ago  12s ago
a5bb8943  d2ff5fcf  id2     1        stop     failed   9h20m ago  28s ago
f7919506  d2ff5fcf  id2     1        run      running  9h32m ago  28s ago

nomad job status id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-09T11:01:33+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2     0       0         2        5       2         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6e78aad6  d2ff5fcf  id2     1        run      running  4m9s ago   5s ago
a5bb8943  d2ff5fcf  id2     1        stop     failed   9h20m ago  38s ago
f7919506  d2ff5fcf  id2     1        run      running  9h32m ago  38s ago

nomad job status -evals  id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-09T11:01:33+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2     0       0         1        6       2         0

Evaluations
ID        Priority  Triggered By   Status    Placement Failures
d574b111  50        alloc-failure  pending   false
296a0c49  50        alloc-failure  complete  false
7bff3e10  50        node-update    complete  false
8aba810c  50        node-update    complete  false
824d129f  50        node-update    complete  false
25afe205  50        node-update    complete  false
7b9198ac  50        queued-allocs  complete  false
d3cf1047  50        queued-allocs  complete  false
7567b1e0  50        alloc-failure  complete  false

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
id2     d574b111  7m59s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6e78aad6  d2ff5fcf  id2     1        run      failed   4m59s ago  1s ago
a5bb8943  d2ff5fcf  id2     1        stop     failed   9h21m ago  1m28s ago
f7919506  d2ff5fcf  id2     1        run      running  9h33m ago  1m28s ago
nomad job status -evals  id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-09T11:01:33+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2     0       0         1        6       2         0

Evaluations
ID        Priority  Triggered By   Status    Placement Failures
d574b111  50        alloc-failure  pending   false
296a0c49  50        alloc-failure  complete  false
7bff3e10  50        node-update    complete  false
8aba810c  50        node-update    complete  false
824d129f  50        node-update    complete  false
25afe205  50        node-update    complete  false
7b9198ac  50        queued-allocs  complete  false
d3cf1047  50        queued-allocs  complete  false
7567b1e0  50        alloc-failure  complete  false

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
id2     d574b111  7m48s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6e78aad6  d2ff5fcf  id2     1        run      failed   5m10s ago  12s ago
a5bb8943  d2ff5fcf  id2     1        stop     failed   9h21m ago  1m39s ago
f7919506  d2ff5fcf  id2     1        run      running  9h33m ago  1m39s ago

cgbaker · 2019-07-10T16:24:03Z

thanks, @jozef-slezak , i'm looking at this right now.

jozef-slezak · 2019-07-10T20:03:47Z

Today I have found a place in the code (alloc/task runner) that causes problem in combination with sudo reboot / KillMode cgroup. Let me explain my hypothesis (if it makes sence to you, I can make small pull request):

Nomad stores state (pid + running)
Most of the time when I call systemctl stop nomad gracefully stops and modifies tu state (terminal) therefore everything is fine after restart
But sometimes it takes longer and it is killed (still state running)
then I start again it tries to restore the client state - pid is nolonger there hence it starts news instance (pull request would change this behavior: reattach existing pid but not start automaticaly new process if pid is nolonger there)
At the same time the server schedules news alloc (your can see two running instances in the listing above; the second alloc fails after while because it hase same port)

I was checking your e2e test and I think I can reuse some parts and isolate this behavior (and maybe test fix/pr)

cgbaker · 2019-07-10T22:06:40Z

Okay, here's what I've been able to do. I'm running the same version of Nomad:

Nomad v0.9.4-dev (0eae387a96d6482cece2a8aa51f4aa8d8616549a+CHANGES)

It's a single node cluster running both client and server. I've registered three raw_exec jobs:

job "example1" {
  datacenters = ["dc1"]

  group "cache" {
    task "redis" {
      driver = "raw_exec"

      config {
 	command = "python"
	args = ["-m", "SimpleHTTPServer", "${NOMAD_PORT_http}"]
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "http" {}
        }
      }

      service {
        name = "http"
        port = "http"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

I run all three, see one allocation per job. Once they're up and running, I sudo reboot. Upon restart, I check the job status and they're all listed as running:

ID        Type     Priority  Status   Submit Date
example1  service  50        running  2019-07-10T21:03:16Z
example2  service  50        running  2019-07-10T21:03:30Z
example3  service  50        running  2019-07-10T21:03:33Z

There are still only three allocations (same IDs), and they've all restarted. Status looks like the following (don't mind the redis task name, this was copied from an example):

ID                  = cb49e316
Eval ID             = 685358e6
Name                = example1.cache[0]
Node ID             = 8e9fcbe9
Node Name           = ip-172-31-87-86
Job ID              = example1
Job Version         = 824635986152
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 57m23s ago
Modified            = 16m8s ago

Task "redis" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  47 MiB/256 MiB  300 MiB  http: 172.31.87.86:25278

Task Events:
Started At     = 2019-07-10T21:44:31Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2019-07-10T21:44:31Z  Started                Task started by client
2019-07-10T21:44:31Z  Failed Restoring Task  failed to restore task; will not run until server is contacted
2019-07-10T21:35:05Z  Started                Task started by client
2019-07-10T21:35:05Z  Failed Restoring Task  failed to restore task; will not run until server is contacted
2019-07-10T21:03:16Z  Started                Task started by client
2019-07-10T21:03:16Z  Task Setup             Building Task Directory
2019-07-10T21:03:16Z  Received               Task received by client

There are three actual instances of the task running (one per job, as expected):

$ ps -aef | grep python
root      1582  1545  0 21:44 ?        00:00:00 /usr/bin/python -m SimpleHTTPServer 23959
root      1583  1551  0 21:44 ?        00:00:00 /usr/bin/python -m SimpleHTTPServer 25278
root      1584  1540  0 21:44 ?        00:00:00 /usr/bin/python -m SimpleHTTPServer 25831

This is all about what I would expect; the client can't reconnect to the task because they were killed, wait to contact the server (per current expected behavior), then restart the task once the previous allocation is verified with the server.

Do you see anything I've done differently from your case?

jozef-slezak · 2019-07-11T15:28:08Z

Yes, I can see one difference: before restarting alloc e65cef3b desired=run+status=running and alloc dc64d219 desired=run status=failed . Then restart sudo reboot (remember I have Count=1 in job definion).

nomad job status id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-11T16:59:20+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id2 0       0         1        1       4         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
e65cef3b  d2ff5fcf  id2 6        run      running   13m59s ago  4m21s ago
535ec5ca  d2ff5fcf  id2     4        stop     complete  16m57s ago  4m25s ago
c2f87270  d2ff5fcf  id2     2        stop     complete  18m39s ago  4m25s ago
dc64d219  d2ff5fcf  id2     1        run      failed    56m13s ago  4m24s ago
f7919506  d2ff5fcf  id2     1        stop     complete  2d4h ago    4m25s ago

After reboot I can see two allocs running (problem is the second alloc dc64d219 is also restored but Count=1 in job definition):

nomad job status id2
ID            = id2
Name          = id2
Submit Date   = 2019-07-11T16:59:20+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
id20       0         2        1       4         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
e65cef3b  d2ff5fcf  id2 6        run      running   16m6s ago   1m53s ago
535ec5ca  d2ff5fcf  id2 4        stop     complete  19m4s ago   2m1s ago
c2f87270  d2ff5fcf  id2 2        stop     complete  20m46s ago  2m ago
dc64d219  d2ff5fcf  id2 1        run      running   58m20s ago  2m7s ago
f7919506  d2ff5fcf  id2 1        stop     complete  2d4h ago    2m ago

cgbaker · 2019-07-11T16:02:19Z

Thanks, I'll file this in our tracker.

cgbaker · 2019-07-11T21:55:29Z

@jozef-slezak , is it possible for you to share your job spec for this job? if you'd rather not post it publicly, you can send it to nomad-oss-debug@hashicorp.com where it will only be viewable by Nomad developers.

jozef-slezak · 2019-07-12T13:56:49Z

Sure, I will do that (probably later today).

cgbaker · 2019-07-12T14:24:31Z

I'm concerned that the previously failed allocations still have desired=run after being rescheduled. There was a bug around this that was fixed in #5790 and should not happen in 0eae387. When I run a failing job with that version, I see the following:

ubuntu@ip-172-31-83-155:~$ nomad server members
Name                     Address        Port  Status  Leader  Protocol  Build      Datacenter  Region
ip-172-31-83-155.global  172.31.83.155  4648  alive   true    2         0.9.4-dev  dc2         global
ubuntu@ip-172-31-83-155:~$ nomad status example
ID            = example
Name          = example
Submit Date   = 2019-07-12T14:10:02Z
Type          = service
Priority      = 50
Datacenters   = dc2
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
group       0       0         0        4       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
group       e91c5d9b  2m24s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
54e54542  b0a140af  group       0        run      failed  2m12s ago  1m36s ago
1ca6e34a  b0a140af  group       0        stop     failed  4m45s ago  2m12s ago
5413a8aa  b0a140af  group       0        stop     failed  6m20s ago  4m45s ago
3c539273  b0a140af  group       0        stop     failed  7m25s ago  6m20s ago

systemctl restart nomad will not restart this task if it has exhausted its retry count; and when the new allocation is scheduled, the old one is marked as desired=stop.

My procedure was:

register job that fails
let it be rescheduled a few times, creating multiple failed allocations
systemctl restart nomad

Any additional help that you can give for reproducing this is very much appreciated. If this bug is still occurring, I want to make sure that we fix it. Thank you for your help, @jozef-slezak .

cgbaker · 2019-07-12T15:16:23Z

Also, can you reproduce this from a clean cluster on 0eae387?

jozef-slezak · 2019-07-12T19:12:01Z

Hello I am attaching HCL that you have asked for:

{
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": null,
        "CreateIndex": 42,
        "Datacenters": [
            "dc1"
        ],
        "Dispatched": false,
        "ID": "id2",
        "JobModifyIndex": 14199,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 21028,
        "Name": "id2",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Spreads": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1562857160611087202,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": null,
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 150,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "id2",
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Spreads": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-c",
                                "cd /opt/id2/id2_customization && exec /bin/bash ./bin/production_run_webserver.sh"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {
                            "id2_DS_PORT": "${NOMAD_PORT_id2_web_port",
                            "HOME": "/home/user2"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "id2-Webserver",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 500,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "id2_web_port",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "id2-Webserver listening on TCP port",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "id2-Webserver",
                                "PortLabel": "id2_web_port",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "user2",
                        "Vault": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-c",
                                "cd /opt/id2/id2_customization && exec /bin/bash ./bin/production_run_worker_mq.sh --type mq"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {
                            "HOME": "/home/user2",
                            "id2_DS_PORT": "${NOMAD_PORT_id2_worker_port}"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "id2-MQ-Worker",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 100,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "id2_worker_port",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "id2-MQ-Worker listening on TCP port",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "id2-MQ-Worker",
                                "PortLabel": "id2_worker_port",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "user2",
                        "Vault": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "-c",
                                "cd /opt/id2/id2_customization && exec /bin/bash ./bin/production_run_delayed_job.sh"
                            ],
                            "command": "/bin/bash"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {
                            "HOME": "/home/user2",
                            "PORT": "${NOMAD_PORT_id2_deleyed_job_port}"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "id2-Delayed-Job",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 100,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "id2_deleyed_job_port",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "id2-Delayed-Job listening on TCP port",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "id2-Delayed-Job",
                                "PortLabel": "id2_deleyed_job_port",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "user2",
                        "Vault": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "command": "/bin/bash",
                            "args": [
                                "-c",
                                "cd /opt/id2/hitlist_worker && exec /bin/bash ./bin/production_run_webserver.sh"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {
                            "id2_DS_PORT": "${NOMAD_PORT_id2_hitlist_worker_port}",
                            "HOME": "/home/user2"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "id2-Hitlist-Worker",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 100,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "id2_hitlist_worker_port",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "id2-Hitlist-Worker listening on TCP port",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "id2-Hitlist-Worker",
                                "PortLabel": "id2_hitlist_worker_port",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "user2",
                        "Vault": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "command": "/bin/bash",
                            "args": [
                                "-c",
                                "cd /opt/id2/id2_customization && exec /bin/bash ./bin/production_run_websocket_server.sh"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "raw_exec",
                        "Env": {
                            "WEBSOCKET_PORT": "${NOMAD_PORT_id2_websocket_port}",
                            "HOME": "/home/user2"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "id2-Websocket-Server",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 100,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "id2_websocket_port",
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 5000000000,
                                        "Method": "",
                                        "Name": "id2-Websocket-Server listening on TCP port",
                                        "Path": "",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "Timeout": 3000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "id2-Websocket-Server",
                                "PortLabel": "id2_websocket_port",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "user2",
                        "Vault": null
                    }
                ],
                "Update": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 6
    }
}

jozef-slezak · 2019-07-12T19:16:21Z

Version 0.9.4-dev definitely behaves better. I can confirm that there was a written state using v0.9.1 and later v0.9.3 (before I upgraded to 0.9.4-dev). I will definitely try to reproduce this on Monday as you suggested (clean cluster on 0eae387). If there will be a problem I am willing to share my screen if you like.

By the way, what is the plan for the official 0.9.4 release?

jozef-slezak · 2019-07-16T16:22:25Z

@cgbaker, I was able to retest everything in an isolated environment (3 node cluster, AWS, dummyapp). I am attaching everything: job HCL, nomad HCL, systemd service file, and simple source code of a dummyapp (you could test it probably with redis or any other executable binary) plus steps how to reproduce below.

I can confirm that 0.9.4-dev behaves much better. I was testing the reschedule/failover of jobs/process as before. I will describe the following behavior: "for a short period of time there are more jobs/processes than configured Count of jobs in a cluster". I believe that nomad client restores the jobs (even if the PID does not exist anymore) and it connects with the server and then suddenly there are no processes anymore. Please check the steps on how to reproduce. Every time I repeat these steps I am able to reproduce this behavior.

Reproduction steps

Run Count=39 (see below dummyapp.hcl)
Stop one node sudo systemctl stop nomad
Wait until cluster detects nomad server members the failed node
Check that jobs are rescheduled ps aux | grep dummy | grep /home/centos/nomad | wc -l (server1: 0 dummyapps, server2: 20 dummyapps, server2: 19 dummyapps)
Start the previously stopped node sudo systemctl start nomad and check what is being restored systemctl status nomad (see following three console outputs)

[centos@ip-172-31-1-106 ~]$ systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-07-16 14:57:09 UTC; 2s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 4105 (nomad)
   CGroup: /system.slice/nomad.service
           ├─4105 /home/centos/nomad agent -config /home/centos/nomad.hcl
           ├─4120 /home/centos/nomad logmon
           ├─4124 /home/centos/nomad logmon
           ├─4133 /home/centos/nomad logmon
           ├─4147 /home/centos/nomad logmon
           ├─4151 /home/centos/nomad logmon
           ├─4171 /home/centos/nomad logmon
           ├─4183 /home/centos/nomad logmon
           ├─4191 /home/centos/nomad logmon
           ├─4210 /home/centos/nomad logmon
           ├─4219 /home/centos/nomad logmon
           ├─4237 /home/centos/nomad logmon
           ├─4242 /home/centos/nomad logmon
           ├─4257 /home/centos/nomad logmon
           ├─4264 /home/centos/nomad logmon
           ├─4278 /home/centos/nomad logmon
           ├─4293 /home/centos/nomad logmon
           ├─4304 /home/centos/nomad logmon
           ├─4315 /home/centos/nomad logmon
           ├─4333 /home/centos/nomad logmon
           ├─4353 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/b3f...
           ├─4363 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/952...
           ├─4373 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/07f...
           ├─4389 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/6be...
           ├─4403 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/43c...
           ├─4409 /home/centos/dummyapp 23182
           ├─4423 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/9dc...
           ├─4425 /home/centos/dummyapp 30103
           ├─4438 /home/centos/dummyapp 23980
           ├─4450 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/920...
           ├─4459 /home/centos/dummyapp 31437
           ├─4471 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/083...
           ├─4485 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/207...
           ├─4492 /home/centos/dummyapp 21873
           ├─4495 /home/centos/dummyapp 30428
           ├─4502 /home/centos/dummyapp 29847
           ├─4508 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/0b9...
           ├─4531 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/583...
           ├─4542 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/382...
           ├─4548 /home/centos/dummyapp 25053
           ├─4552 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/6a0...
           ├─4555 /home/centos/dummyapp 21414
           ├─4577 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/cac...
           ├─4589 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/60b...
           ├─4590 /home/centos/dummyapp 29552
           ├─4609 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/f2e...
           ├─4620 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/e7e...
           ├─4621 /home/centos/dummyapp 27677
           ├─4622 /home/centos/dummyapp 27361
           ├─4623 /home/centos/dummyapp 20171
           ├─4637 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/ef0...
           ├─4651 /home/centos/dummyapp 26417
           ├─4657 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/dcf...
           ├─4659 /home/centos/dummyapp 27463
           ├─4692 /home/centos/dummyapp 31231
           ├─4703 /home/centos/dummyapp 24672
           ├─4704 /home/centos/dummyapp 22444
           └─4707 /home/centos/dummyapp 29509

Jul 16 14:57:10 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: serf: m...
Jul 16 14:57:10 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T14:...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T14:...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: raft: H...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T14:...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: raft: N...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T14:...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: raft: V...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T14:...
Jul 16 14:57:11 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: raft: V...
Hint: Some lines were ellipsized, use -l to show in full.
[centos@ip-172-31-1-106 ~]$ systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-07-16 14:57:09 UTC; 7s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 4105 (nomad)
   CGroup: /system.slice/nomad.service
           ├─4105 /home/centos/nomad agent -config /home/centos/nomad.hcl
           ├─4120 /home/centos/nomad logmon
           ├─4124 /home/centos/nomad logmon
           ├─4133 /home/centos/nomad logmon
           ├─4147 /home/centos/nomad logmon
           ├─4151 /home/centos/nomad logmon
           ├─4171 /home/centos/nomad logmon
           ├─4183 /home/centos/nomad logmon
           ├─4191 /home/centos/nomad logmon
           ├─4210 /home/centos/nomad logmon
           ├─4219 /home/centos/nomad logmon
           ├─4237 /home/centos/nomad logmon
           ├─4242 /home/centos/nomad logmon
           ├─4257 /home/centos/nomad logmon
           ├─4264 /home/centos/nomad logmon
           ├─4278 /home/centos/nomad logmon
           ├─4293 /home/centos/nomad logmon
           ├─4304 /home/centos/nomad logmon
           ├─4315 /home/centos/nomad logmon
           ├─4333 /home/centos/nomad logmon
           ├─4353 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/b3f...
           ├─4363 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/952...
           ├─4373 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/07f...
           ├─4403 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/43c...
           ├─4409 /home/centos/dummyapp 23182
           ├─4425 /home/centos/dummyapp 30103
           ├─4438 /home/centos/dummyapp 23980
           ├─4450 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/920...
           ├─4471 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/083...
           ├─4485 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/207...
           ├─4492 /home/centos/dummyapp 21873
           ├─4502 /home/centos/dummyapp 29847
           ├─4548 /home/centos/dummyapp 25053
           ├─4552 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/6a0...
           ├─4555 /home/centos/dummyapp 21414
           ├─4577 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/cac...
           ├─4620 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/e7e...
           ├─4623 /home/centos/dummyapp 20171
           └─4657 /home/centos/nomad executor {"LogFile":"/home/centos/data_dir/alloc/dcf...

After while there are no dummyapp process on the recently started node.

[centos@ip-172-31-1-106 ~]$ systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2019-07-16 14:57:09 UTC; 15s ago
     Docs: https://nomadproject.io/docs/
 Main PID: 4105 (nomad)
   CGroup: /system.slice/nomad.service
           └─4105 /home/centos/nomad agent -config /home/centos/nomad.hcl

Job file (if appropriate)

job "dummyapp" {
  datacenters = ["dc1"]
  priority = 80

  group "dummyapp" {
    count = 39
    task "dummyapp" {
      driver = "raw_exec"

      config {
        command = "/home/centos/dummyapp",
        args = ["${NOMAD_PORT_WEB}"]
      }

      resources {
        cpu    = 20
        memory = 32
        network {
                port "WEB" {}
        }
      }
    }
  }
}

# Increase log verbosity
log_level = "DEBUG"
enable_syslog = true

# Setup data dir
data_dir = "/home/centos/data_dir"

# Give the agent a unique name. Defaults to hostname
name = "server1"

# Enable the server
server {
  enabled = true

  server_join {
    retry_join = ["172.31.1.106:4648", "172.31.1.32:4648"]
  }

  # Self-elect, should be 3 or 5 for production
  bootstrap_expect = 3
}

client {
  enabled = true
  server_join {
    retry_join = ["127.0.0.1:4647", "172.31.1.106:4647", "172.31.1.32:4647"]
  }
  options = {
    "driver.raw_exec.enable" = "1"
    "driver.raw_exec.no_cgroups" = "1"
  }
}

consul {
  auto_advertise      = false
  server_auto_join    = false
  client_auto_join    = false
}

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

# When using Nomad with Consul it is not necessary to start Consul first. These
# lines start Consul before Nomad as an optimization to avoid Nomad logging
# that Consul is unavailable at startup.
#Wants=consul.service
#After=consul.service

[Service]
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/home/centos/nomad agent -config /home/centos/nomad.hcl
#KillMode=process
#KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
StartLimitBurst=3
StartLimitIntervalSec=10
TasksMax=infinity

[Install]
WantedBy=multi-user.target

dummyapp.zip

I want to mention that I have intentionally not configured (KillSignal=SIGINT) - systemd efectively killed dummyapps and caused.

systemctl status nomad
● nomad.service - Nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-07-16 16:03:00 UTC; 3min 31s ago
     Docs: https://nomadproject.io/docs/
  Process: 4105 ExecStart=/home/centos/nomad agent -config /home/centos/nomad.hcl (code=exited, status=1/FAILURE)
 Main PID: 4105 (code=exited, status=1/FAILURE)

Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T16:03:00.182Z [WARN ] nomad: serf: Shutdown without a Leave
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: serf: Shutdown without a Leave
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T16:03:00.184Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 172.31.1.163:4647 172.31.1.163:4647}
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: nomad: raft: aborting pipeline replication to peer {Voter 172.31.1.163:4647 172.31.1.163:4647}
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T16:03:00.184Z [INFO ] nomad: raft: aborting pipeline replication to peer {Voter 172.31.1.32:4647 172.31.1.32:4647}
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal nomad[4105]: 2019-07-16T16:03:00.185Z [INFO ] agent: shutdown complete
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal systemd[1]: Stopped Nomad.
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal systemd[1]: Unit nomad.service entered failed state.
Jul 16 16:03:00 ip-172-31-1-106.eu-central-1.compute.internal systemd[1]: nomad.service failed.

cgbaker · 2019-07-17T11:58:27Z

thanks, @jozef-slezak , will parse this after ☕️

cgbaker · 2019-07-17T18:16:27Z

Did the Nomad client restart those jobs on restart, or were they still running?

When I tested this, my Nomad client does not restart the jobs... the following is in my logs:

    2019-07-17T18:12:55.302Z [INFO ] client.alloc_runner.task_runner: task failed to restore; waiting to contact server before restarting: alloc_id=889c96ae-fdba-0336-9a11-f79006350571 task=task

This was one of the recent changes, where Nomad will not restart a task until it has contacted the server.

0.9.4 RC is scheduled for the end of this week.

jozef-slezak · 2019-07-18T13:44:53Z

@cgbaker Nomad client+server was killed by systemctl command. When I started that nomad process again it started processes/jobs early after nomad startup and then silently stopped them after while (during that period of time there were more instances than expected job Count). This is an easy way how to reproduce this behavior using Nomad v0.9.4-dev 596b5aa (it should be easy to also write a unit test for this behavior).

cgbaker · 2019-07-18T16:20:08Z

@jozef-slezak , I'm not able to reproduce this. As in my previous comment, when I restart the client, it does not restart the processes associated with the previous allocations until it has contacted the servers. Also, this single GitHub issue spanned multiple different problems and multiple versions of Nomad. If you are amenable, I would like to close this issue and have you test the current criticism with the 0.9.4 RC (which it planned for tomorrow, but will happen very soon).

notnoop · 2019-08-27T14:13:12Z

@jozef-slezak Thank you for your patience here. Looking at the notes, I suspect you are hitting #5984 , which is addressed in #6207 and #6216 . Would you be able to test those PRs in your cluster by any chance?

jozef-slezak · 2019-08-27T14:25:09Z

Sure, I can retest the PR. Please send me a link with the compiled binary (ideally including the UI).

jozef-slezak changed the title ~~Constraint/count not respected after Nomad cluster restart~~ Constraint/count is not respected after Nomad cluster restart Jul 4, 2019

cgbaker self-assigned this Jul 5, 2019

cgbaker added type/bug theme/client labels Jul 5, 2019

jozef-slezak changed the title ~~Constraint/count is not respected after Nomad cluster restart~~ Constraint/count is not respected after Nomad cluster restart (previously failed allocs) Jul 8, 2019

jozef-slezak mentioned this issue Jul 8, 2019

Nomad server failover vs. task restart - Consul DNS #5908

Closed

cgbaker added the stage/waiting-reply label Jul 9, 2019

stale bot removed the stage/waiting-reply label Jul 10, 2019

cgbaker added the stage/waiting-reply label Jul 10, 2019

stale bot removed the stage/waiting-reply label Jul 11, 2019

cgbaker removed their assignment Apr 23, 2020

notnoop mentioned this issue Sep 15, 2020

Stop already rescheduled but somehow running allocs #8886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

jozef-slezak commented Jul 4, 2019

cgbaker commented Jul 5, 2019 •

edited

Loading

cgbaker commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

cgbaker commented Jul 5, 2019

jozef-slezak commented Jul 8, 2019 •

edited

Loading

jozef-slezak commented Jul 9, 2019 •

edited

Loading

cgbaker commented Jul 9, 2019

jozef-slezak commented Jul 10, 2019

jozef-slezak commented Jul 10, 2019

cgbaker commented Jul 10, 2019

jozef-slezak commented Jul 10, 2019 •

edited

Loading

cgbaker commented Jul 10, 2019

jozef-slezak commented Jul 11, 2019

cgbaker commented Jul 11, 2019

cgbaker commented Jul 11, 2019

jozef-slezak commented Jul 12, 2019

cgbaker commented Jul 12, 2019 •

edited

Loading

cgbaker commented Jul 12, 2019

jozef-slezak commented Jul 12, 2019

jozef-slezak commented Jul 12, 2019 •

edited

Loading

jozef-slezak commented Jul 16, 2019

cgbaker commented Jul 17, 2019

cgbaker commented Jul 17, 2019

jozef-slezak commented Jul 18, 2019 •

edited

Loading

cgbaker commented Jul 18, 2019 •

edited

Loading

notnoop commented Aug 27, 2019

jozef-slezak commented Aug 27, 2019

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Comments

jozef-slezak commented Jul 4, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

cgbaker commented Jul 5, 2019 • edited Loading

cgbaker commented Jul 5, 2019 • edited Loading

jozef-slezak commented Jul 5, 2019 • edited Loading

cgbaker commented Jul 5, 2019

jozef-slezak commented Jul 8, 2019 • edited Loading

jozef-slezak commented Jul 9, 2019 • edited Loading

cgbaker commented Jul 9, 2019

jozef-slezak commented Jul 10, 2019

jozef-slezak commented Jul 10, 2019

cgbaker commented Jul 10, 2019

jozef-slezak commented Jul 10, 2019 • edited Loading

cgbaker commented Jul 10, 2019

jozef-slezak commented Jul 11, 2019

cgbaker commented Jul 11, 2019

cgbaker commented Jul 11, 2019

jozef-slezak commented Jul 12, 2019

cgbaker commented Jul 12, 2019 • edited Loading

cgbaker commented Jul 12, 2019

jozef-slezak commented Jul 12, 2019

jozef-slezak commented Jul 12, 2019 • edited Loading

jozef-slezak commented Jul 16, 2019

Reproduction steps

Job file (if appropriate)

cgbaker commented Jul 17, 2019

cgbaker commented Jul 17, 2019

jozef-slezak commented Jul 18, 2019 • edited Loading

cgbaker commented Jul 18, 2019 • edited Loading

notnoop commented Aug 27, 2019

jozef-slezak commented Aug 27, 2019

cgbaker commented Jul 5, 2019 •

edited

Loading

cgbaker commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 5, 2019 •

edited

Loading

jozef-slezak commented Jul 8, 2019 •

edited

Loading

jozef-slezak commented Jul 9, 2019 •

edited

Loading

jozef-slezak commented Jul 10, 2019 •

edited

Loading

cgbaker commented Jul 12, 2019 •

edited

Loading

jozef-slezak commented Jul 12, 2019 •

edited

Loading

jozef-slezak commented Jul 18, 2019 •

edited

Loading

cgbaker commented Jul 18, 2019 •

edited

Loading