Nodes and their services keep appearing and disappearing from the catalog #5518

ShimmerGlass · 2019-03-20T10:40:27Z

Overview of the Issue

We are seeing nodes and their services disappear and reappear in the catalog every few minutes without any change on our side (API call on agents or servers, agent reload). Affected nodes keep appearing and disappearing until action is taken on our side.

When this happens the health checks registered on the node do not change status and stay passing until they are deregistered, and are passing when they are registered again. There are no unusual logs on either the affected node's agents or on the servers.

When debugging this issue we found a scenario that can explain this :

A node is provisioned with name foo and a stable node-id that will not change.
It joins its peers memberlists and is registered in the catalog
The node is renamed to bar and keeps the same node-id
The old memberlist entry with name foo is kept and a new is created with name bar
The old node is deleted from the catalog in this block https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L416 and is syncs its services again
The old memberlist entry is still checked and reported to the leader
No node with this name exist in the catalog (https://github.com/hashicorp/consul/blob/master/agent/consul/leader.go#L1283) so a registration requests is dispatched
Again, no node with this name, so we register it (https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L310)
We realize that a node with this node-id but a different name (bar) already exists so we delete it (https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L416)
We register the node with the old name (bar).
Note that here we do not register any services or checks since this only comes from a serf event
The node performs its anti-entropy sync
bar is deleted and foo registered again along with its services and checks
GOTO step 6

Restarting the leader fixes the issue, we then tested our theory by force-leaving the node's old name without any restart on the agents or servers side, this fixed the issue as well confirming our theory.

Consul info for both Client and Server

Consul version 1.3.1 with patches on Centos 7 on both servers and clients

The text was updated successfully, but these errors were encountered:

When receiving a serf faild message for a node which is not in the catalog, do not perform a register request to set is serf heath to critical as it could overwrite the node information and services if it was renamed. Fixes : hashicorp#5518

ShimmerGlass · 2019-03-20T17:40:50Z

This is what it looks like in consul-templaterb timelines :

When receiving a serf faild message for a node which is not in the catalog, do not perform a register request to set is serf heath to critical as it could overwrite the node information and services if it was renamed. Fixes : hashicorp#5518

ShimmerGlass · 2019-03-25T16:41:44Z

Just realized the PR was not properly linked to this issue : #5520

When receiving a serf faild message for a node which is not in the catalog, do not perform a register request to set is serf heath to critical as it could overwrite the node information and services if it was renamed. Fixes : hashicorp#5518

#5520) When receiving a serf faild message for a node which is not in the catalog, do not perform a register request to set is serf heath to critical as it could overwrite the node information and services if it was renamed. Fixes : #5518

ShimmerGlass mentioned this issue Mar 20, 2019

Fix: fail messages after a node rename replace the new node definition #5520

Merged

ChipV223 closed this as completed Mar 20, 2019

ChipV223 reopened this Mar 25, 2019

pearkes added type/bug Feature does not function as expected waiting-pr-merge labels Apr 3, 2019

banks closed this as completed in #5520 Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes and their services keep appearing and disappearing from the catalog #5518

Nodes and their services keep appearing and disappearing from the catalog #5518

ShimmerGlass commented Mar 20, 2019

ShimmerGlass commented Mar 20, 2019

ShimmerGlass commented Mar 25, 2019

Nodes and their services keep appearing and disappearing from the catalog #5518

Nodes and their services keep appearing and disappearing from the catalog #5518

Comments

ShimmerGlass commented Mar 20, 2019

Overview of the Issue

Consul info for both Client and Server

ShimmerGlass commented Mar 20, 2019

ShimmerGlass commented Mar 25, 2019