You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I stood up 3 server nodes and 10 (16 core) client nodes. Once the nodes came up I submitted 5 jobs. Each job had a task group count of 100. Two of the jobs changed to the running state and ran to completion (ending up in the dead state). The other 3 jobs switched into the dead state without ever going to the running state. I have included the verbose status for two of the stuck jobs below.
It looks to me as though there is an issue in the max-plan-attempt logic that was released in 0.4.0-rc1. Have you thought about the use case where you have a nomad cluster that is running in a cloud provider and is set up to auto scale and start client nodes as jobs are queued with the system? How should the max-plan-attempt logic work when jobs are submitted to a cluster that has no client nodes?
I can "kick start" the jobs that are dead with no completions (stuck in a blocked max-plan-attempt state) by using the HTTP API and forcing an evaluation (.../v1/job/[job-id]/evaluation). Once I do this the jobs will switch from dead to running and eventually complete.
Nomad Status Output
nomad status -verbose test6
ID = test6
Name = test-06
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false
Evaluations
ID Priority Triggered By Status Placement Failures
7441818d-0391-b0f2-99a9-afaae43fa0b5 50 max-plan-attempts failed false
acc43f15-32c3-0817-cd79-f70d67bb7523 50 max-plan-attempts canceled false
fa6a22fa-d59c-da86-99c1-0b45a1bfe6c6 50 job-register failed true
5347dbb1-650c-6493-673a-8abd8a166093 50 job-register failed true
Allocations
No allocations placed
nomad status -verbose test8
ID = test8
Name = test-08
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false
Evaluations
ID Priority Triggered By Status Placement Failures
ad4616eb-5ad0-6d21-af97-2740eebfd710 50 max-plan-attempts failed false
d58adc64-557f-8145-a83e-6d1f2aa40d37 50 job-register failed true
9b73cda3-084c-08af-50b9-e0a102a05fce 50 job-register complete true
The max-plan-attempts occurs when the schedulers try to make placements that are rejected by the leader several times in a row. This prevents the case of endlessly spinning the schedulers if there is so much cluster contention.
We retry those failed evaluations at one minute intervals such that if contention is reduced those jobs can make progress again.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.4.0-rc1 ('3c578fccde793a515bd1640c530f8df888a63b45')
Operating System
CentOS7
Issue
I stood up 3 server nodes and 10 (16 core) client nodes. Once the nodes came up I submitted 5 jobs. Each job had a task group count of 100. Two of the jobs changed to the running state and ran to completion (ending up in the dead state). The other 3 jobs switched into the dead state without ever going to the running state. I have included the verbose status for two of the stuck jobs below.
It looks to me as though there is an issue in the max-plan-attempt logic that was released in 0.4.0-rc1. Have you thought about the use case where you have a nomad cluster that is running in a cloud provider and is set up to auto scale and start client nodes as jobs are queued with the system? How should the max-plan-attempt logic work when jobs are submitted to a cluster that has no client nodes?
I can "kick start" the jobs that are dead with no completions (stuck in a blocked max-plan-attempt state) by using the HTTP API and forcing an evaluation (.../v1/job/[job-id]/evaluation). Once I do this the jobs will switch from dead to running and eventually complete.
Nomad Status Output
nomad status -verbose test6
ID = test6
Name = test-06
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false
Evaluations
ID Priority Triggered By Status Placement Failures
7441818d-0391-b0f2-99a9-afaae43fa0b5 50 max-plan-attempts failed false
acc43f15-32c3-0817-cd79-f70d67bb7523 50 max-plan-attempts canceled false
fa6a22fa-d59c-da86-99c1-0b45a1bfe6c6 50 job-register failed true
5347dbb1-650c-6493-673a-8abd8a166093 50 job-register failed true
Allocations
No allocations placed
nomad status -verbose test8
ID = test8
Name = test-08
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false
Evaluations
ID Priority Triggered By Status Placement Failures
ad4616eb-5ad0-6d21-af97-2740eebfd710 50 max-plan-attempts failed false
d58adc64-557f-8145-a83e-6d1f2aa40d37 50 job-register failed true
9b73cda3-084c-08af-50b9-e0a102a05fce 50 job-register complete true
Allocations
No allocations placed
Job file
The text was updated successfully, but these errors were encountered: