During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

jcjones · 2022-08-13T00:29:01Z

I changed the name of the redis-server task (from server to redis-server) and redeployed the jobspec. For each allocation, after it restarted with the new task name, it failed to rejoin the cluster until being restarted a second time (via the Nomad GUI).

Attache-control logs:

time="2022-08-13T00:20:54Z" level=info msg="starting /usr/local/bin/attache-control"
time="2022-08-13T00:20:54Z" level=info msg="initializing a new redis client"
time="2022-08-13T00:20:54Z" level=info msg="initializing a new consul client"
time="2022-08-13T00:20:54Z" level=info msg="fetching scaling options from consul path 'service/redis-cluster/scaling'"
time="2022-08-13T00:20:57Z" level=info msg="this node is already part of an existing cluster"
time="2022-08-13T00:20:57Z" level=info msg="running until killed..."

Redis however is not part of a cluster:

10.0.32.81:20001> cluster nodes
0bd16fb965741d36e64304458b4f0264c248d25e 10.0.32.81:20001@30001 myself,master - 0 0 0 connected

The Redis log is very empty, no mention of being told to join:

1:C 13 Aug 2022 00:20:46.789 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 13 Aug 2022 00:20:46.789 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 13 Aug 2022 00:20:46.789 # Configuration loaded
1:M 13 Aug 2022 00:20:46.796 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
1:M 13 Aug 2022 00:20:46.796 # Server initialized
1:M 13 Aug 2022 00:20:46.796 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
1:M 13 Aug 2022 00:20:46.847 # IP address for this node updated to 10.0.32.81

This is trivially fixable with operator intervention by just restarting the alloc again.

The text was updated successfully, but these errors were encountered:

jcjones · 2022-08-13T00:33:25Z

Restarting just attache-control is not sufficient.

jcjones · 2022-08-15T17:31:23Z

This appears to have been caused by the supplied consul server being unreachable at a firewall, and it took many minutes to log the connection failure.

I think there is still improvement to be made here in that error case but I don't know exactly what yet.

jcjones · 2022-08-15T18:34:42Z

One significant thing to do here is to make it more clear whether Consul is timing out. I think that might be it for this issue, and we might do other things to improve the deployment model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

jcjones commented Aug 13, 2022

jcjones commented Aug 13, 2022

jcjones commented Aug 15, 2022

jcjones commented Aug 15, 2022

During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

During a jobspec change to the redis-server task, the cluster did not automatically rebuild #34

Comments

jcjones commented Aug 13, 2022

jcjones commented Aug 13, 2022

jcjones commented Aug 15, 2022

jcjones commented Aug 15, 2022