-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
A customer reports a test cluster getting wedged (no queries going through) during simulation of a multi-node outage. The scenario is a 15-node cluster, configured with 15-way replication (confirmed that every zone config has 15-way replication). During the test, a set of "ping" queries are being run against each node in the cluster. These queries are simple SELECT * FROM <table> LIMIT 1 for each of the customers tables (~10). 5 of the nodes are taken down and left down for 5 minutes so that the cluster reports them as dead. These 5 nodes are then brought back up. Fairly quickly, 2 of these restarted nodes start showing up in the logs of every node with the error in the title. For example, n7 has been removed from the cluster. Several minutes of waiting, and the "ping" queries suddenly stop returning. It appears to happen to one particular table.
It is unclear if the has been removed from the cluster errors are related to the wedge, but they are disconcerting as they absolutely shouldn't happen. The code which generates this error is Gossip.getNodeDescriptorLocked. The error is only returned if a node descriptor is present in the info store, but contains a zero node ID or an empty address. It is extremely curious that the node descriptor is not in the cached Gossip.nodeDescs map.
This problem was seen on 2.1.3 (to be precise, the problem was seen on a custom 2.1.X binary which happens to be the same SHA as what became 2.1.3). The cluster has a relatively modest amount of data (a little over 100 ranges).