Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes and their services keep appearing and disappearing from the catalog #5518

Closed
ShimmerGlass opened this issue Mar 20, 2019 · 2 comments · Fixed by #5520
Closed

Nodes and their services keep appearing and disappearing from the catalog #5518

ShimmerGlass opened this issue Mar 20, 2019 · 2 comments · Fixed by #5520
Labels
type/bug Feature does not function as expected

Comments

@ShimmerGlass
Copy link
Contributor

Overview of the Issue

We are seeing nodes and their services disappear and reappear in the catalog every few minutes without any change on our side (API call on agents or servers, agent reload). Affected nodes keep appearing and disappearing until action is taken on our side.

When this happens the health checks registered on the node do not change status and stay passing until they are deregistered, and are passing when they are registered again. There are no unusual logs on either the affected node's agents or on the servers.

When debugging this issue we found a scenario that can explain this :

  1. A node is provisioned with name foo and a stable node-id that will not change.
  2. It joins its peers memberlists and is registered in the catalog
  3. The node is renamed to bar and keeps the same node-id
  4. The old memberlist entry with name foo is kept and a new is created with name bar
  5. The old node is deleted from the catalog in this block https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L416 and is syncs its services again
  6. The old memberlist entry is still checked and reported to the leader
  7. No node with this name exist in the catalog (https://github.com/hashicorp/consul/blob/master/agent/consul/leader.go#L1283) so a registration requests is dispatched
  8. Again, no node with this name, so we register it (https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L310)
  9. We realize that a node with this node-id but a different name (bar) already exists so we delete it (https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L416)
  10. We register the node with the old name (bar).
  11. Note that here we do not register any services or checks since this only comes from a serf event
  12. The node performs its anti-entropy sync
  13. bar is deleted and foo registered again along with its services and checks
  14. GOTO step 6

Restarting the leader fixes the issue, we then tested our theory by force-leaving the node's old name without any restart on the agents or servers side, this fixed the issue as well confirming our theory.

Consul info for both Client and Server

Consul version 1.3.1 with patches on Centos 7 on both servers and clients

ShimmerGlass pushed a commit to ShimmerGlass/consul that referenced this issue Mar 20, 2019
When receiving a serf faild message for a node which is not in the
catalog, do not perform a register request to set is serf heath to
critical as it could overwrite the node information and services if it
was renamed.

Fixes : hashicorp#5518
@ShimmerGlass
Copy link
Contributor Author

This is what it looks like in consul-templaterb timelines :

image

ShimmerGlass pushed a commit to criteo-forks/consul that referenced this issue Mar 21, 2019
When receiving a serf faild message for a node which is not in the
catalog, do not perform a register request to set is serf heath to
critical as it could overwrite the node information and services if it
was renamed.

Fixes : hashicorp#5518
@ChipV223 ChipV223 reopened this Mar 25, 2019
@ShimmerGlass
Copy link
Contributor Author

Just realized the PR was not properly linked to this issue : #5520

@pearkes pearkes added type/bug Feature does not function as expected waiting-pr-merge labels Apr 3, 2019
ShimmerGlass pushed a commit to criteo-forks/consul that referenced this issue Apr 4, 2019
When receiving a serf faild message for a node which is not in the
catalog, do not perform a register request to set is serf heath to
critical as it could overwrite the node information and services if it
was renamed.

Fixes : hashicorp#5518
banks pushed a commit that referenced this issue Apr 26, 2019
#5520)

When receiving a serf faild message for a node which is not in the
catalog, do not perform a register request to set is serf heath to
critical as it could overwrite the node information and services if it
was renamed.

Fixes : #5518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants