gossip: unexpected "n%d has been removed from the cluster" errors

A customer reports a test cluster getting wedged (no queries going through) during simulation of a multi-node outage. The scenario is a 15-node cluster, configured with 15-way replication (confirmed that every zone config has 15-way replication). During the test, a set of "ping" queries are being run against each node in the cluster. These queries are simple `SELECT * FROM <table> LIMIT 1` for each of the customers tables (~10). 5 of the nodes are taken down and left down for 5 minutes so that the cluster reports them as dead. These 5 nodes are then brought back up. Fairly quickly, 2 of these restarted nodes start showing up in the logs of every node with the error in the title. For example, `n7 has been removed from the cluster`. Several minutes of waiting, and the "ping" queries suddenly stop returning. It appears to happen to one particular table.

It is unclear if the `has been removed from the cluster` errors are related to the wedge, but they are disconcerting as they  absolutely shouldn't happen. The code which generates this error is `Gossip.getNodeDescriptorLocked`. The error is only returned if a node descriptor is present in the info store, but contains a zero node ID or an empty address. It is extremely curious that the node descriptor is not in the cached `Gossip.nodeDescs` map.

This problem was seen on 2.1.3 (to be precise, the problem was seen on a custom 2.1.X binary which happens to be the same SHA as what became 2.1.3). The cluster has a relatively modest amount of data (a little over 100 ranges).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gossip: unexpected "n%d has been removed from the cluster" errors #34120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gossip: unexpected "n%d has been removed from the cluster" errors #34120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions