You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Forcibly disconnect one of the clients by running: sudo iptables -A OUTPUT -p tcp --dport 4647 -j DROP, which results in the expected RPC errors and eventually the node getting marked as down:
2022-01-26T20:33:27.510Z [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: dial tcp 192.168.56.20:4647: i/o timeout"
2022-01-26T20:33:27.510Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateStatus server=192.168.56.20:4647
2022-01-26T20:33:27.510Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateStatus server=192.168.56.20:4647
2022-01-26T20:33:27.510Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: rpc error: lead thread didn't get connection" period=16.919215002s
The allocation on that client is stopped, as we want, with the following log entry on the disconnected client:
2022-01-26T20:31:20.975Z [DEBUG] client: stopping alloc for stop_after_client_disconnect: alloc=63722979-1fc1-d683-e4dc-664ea93260d1
So far so good. But then replacement allocation is started and almost immediately exits, having been marked complete, a few seconds later:
$ nomad job status httpd
ID = httpd
Name = httpd
Submit Date = 2022-01-26T20:30:50Z
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
web 0 0 2 0 2 0
Latest Deployment
ID = 2eb4e3fd
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
web 3 3 3 0 2022-01-26T20:41:02Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
172ed8d3 bcd3edfe web 0 run complete 36s ago 5s ago
63722979 39ec5cf0 web 0 run complete 2m10s ago 1m35s ago
cfdce531 64551e34 web 0 run running 2m10s ago 1m59s ago
e5644f89 64551e34 web 0 run running 2m10s ago 1m59s ago
Alloc status is as follows (the alloc logs are immediately GC'd, which is also suspicious):
$ nomad alloc status 172ed8d3
ID = 172ed8d3-b044-6a21-9575-42a2ae4f0ffa
Eval ID = 8c932c8d
Name = httpd.web[1]
Node ID = bcd3edfe
Node Name = nomad-client1
Job ID = httpd
Job Version = 0
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
Created = 4m49s ago
Modified = 4m18s ago
Allocation Addresses (mode = "bridge")
Label Dynamic Address
*www yes 10.0.2.15:30147 -> 8001
Task "http" is "dead"
Task Resources
CPU Memory Disk Addresses
0/128 MHz 48 KiB/128 MiB 300 MiB
Task Events:
Started At = 2022-01-26T20:32:26Z
Finished At = 2022-01-26T20:32:55Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-01-26T20:32:55Z Killed Task successfully killed
2022-01-26T20:32:55Z Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-01-26T20:32:50Z Killing Sent interrupt. Waiting 5s before force killing
2022-01-26T20:32:26Z Started Task started by client
2022-01-26T20:32:26Z Task Setup Building Task Directory
2022-01-26T20:32:25Z Received Task received by client
The client for this new allocation on the new node has the following log entry:
2022-01-26T20:32:50.432Z [DEBUG] client: stopping alloc for stop_after_client_disconnect: alloc=172ed8d3-b044-6a21-9575-42a2ae4f0ffa
Firing off new evals results in the same behavior:
If I job run again, I get a new allocation that exits.
When I reconnected the client, a new eval got fired and a new allocation was launched, but it faced the same fate and was marked complete immediately.
The status is always:
Client Status = complete
Client Description = All tasks have completed
Desired Status = run
Desired Description = <none>
My hypothesis is that the replacement allocations are being "tainted" in some way by the scheduler and that's getting picked up by the client? But I don't have a theory as to the mechanism for that.
tgross
changed the title
allocs rescheduled for stop_after_client_disconnect are also stoppedstop_after_client_disconnect causes incorrect alloc stops
Jan 26, 2022
On further testing it looks like this happens whether or not the allocation is rescheduled. After some time, even on a stable cluster without any disconnects, all of the allocations eventually stops and is marked complete.
Hi @morphine1900, it is not currently solved. Once a PR is raised to fix the issue, it will be linked here, and this issue closed once it has been merged.
Hi @morphine1900, it is not currently solved. Once a PR is raised to fix the issue, it will be linked here, and this issue closed once it has been merged.
Hi Jrasell, thanks for the reply, is there any plan to fix it?
While testing out a netsplit scenario for #11892 I ran into buggy behavior around the
stop_after_client_disconnect
field.I have the following jobspec running on three clients (note that no CSI is involved here, to reduce variables):
jobspec
Forcibly disconnect one of the clients by running:
sudo iptables -A OUTPUT -p tcp --dport 4647 -j DROP
, which results in the expected RPC errors and eventually the node getting marked asdown
:The allocation on that client is stopped, as we want, with the following log entry on the disconnected client:
So far so good. But then replacement allocation is started and almost immediately exits, having been marked
complete
, a few seconds later:Alloc status is as follows (the alloc logs are immediately GC'd, which is also suspicious):
The client for this new allocation on the new node has the following log entry:
Firing off new evals results in the same behavior:
job run
again, I get a new allocation that exits.The status is always:
My hypothesis is that the replacement allocations are being "tainted" in some way by the scheduler and that's getting picked up by the client? But I don't have a theory as to the mechanism for that.
full alloc API object
The text was updated successfully, but these errors were encountered: