stability: Restarting nodes with 100k replicas causes unavailability #35063

bdarnell · 2019-02-19T18:41:02Z

A forum user reported problems after restarting nodes in their cluster. The reason for the restart was to upgrade from 2.1.3 to 2.1.4, but downgrading to 2.1.3 did not fix the problem so it appears to be caused by the restart and not by the version upgrade. Notes were restarted one at a time with 1-2 minutes between them. Hardware configuration: 5x 12-vcpu nodes on GCP with 3 1TB SSDs attached to each (configured as separate stores). The cluster was idle at the time of the restart.

Symptoms during the outage:

The following metrics were much higher than the baseline and increased over the duration of the outage: CPU usage, disk write ops, raft append and election messages (MsgApp, MsgVote, MsgPreVote, and their responses), network bytes received.
These metrics were much higher than the baseline but relatively flat: KV transaction latency, node heartbeat latency, raft heartbeat messages, raft time, pending heartbeats
During the outage, CPU usage and disk write ops were much higher than the previous baseline, and increased over time (disk writes increased at a steeper slope than cpu). KV transaction latency and node heartbeat latency p99s were high (over a second) but not increasing. Nodes were missing their heartbeats and getting marked as non-live, which led to confusing range counts in the UI (when nodes pop in and out of "live" status, there is confusion about who is responsible for counting a given range and they can be double-counted).

After about 9 hours, the cluster suddenly stabilized on its own. Most metrics returned to normal (and this cluster appeared to be functioning normally), but pending hearatbeats, raft time, and network bytes received stayed high for several more hours. During this phase the unquiesced replica count was high but gradually declining. When unquiesced replicas got down to about 33% of the total (rough eyeballing), raft time and pending heartbeats started to drop steeply. All of these metrics hit zero at about the same time, at which point the cluster appeared to be completely back to normal.

Questions for investigation:

Why is restarting nodes so disruptive? Do we need to throttle range wakeups to prevent them from overloading the cluster?
Why was disk I/O steadily increasing until it suddenly stopped?
After the recovery, I'd expect raft time and pending heartbeats to be proportional to unquiesced replicas. Why did those graphs have a different shape?

Since this was a planned restart, we could investigate the shutdown/draining process to make sure it's transferring leases away appropriately. This has a fixed 5s timeout (raftLeadershipTransferWait) which is likely too short for such a large replica count. This should probably be increased or made proportional to the number of replicas. However, not all restarts are planned so it's more important to ensure that even an unclean restart doesn't result in cluster-breaking amounts of traffic.

The text was updated successfully, but these errors were encountered:

bdarnell · 2019-02-19T19:36:26Z

I think this may be related to what I was talking about in this comment. The replica scanner can activate ranges without attempting to acquire a lease, which causes them to have raft elections, but if they don't get a lease they can't quiesce. (Eventually the consistency queue will force them to acquire a lease, but this runs on a 24h pace).

The biggest thing I can't explain at this point is why the raft election traffic increased steadily before dropping off suddenly (i think this was the root cause that led to the corresponding increases in raft append and cpu/disk utilization) during the 9 hours of the outage. My expectation is that the replica scanner issue would cause elections to peak 10m after the start of the incident and then gradually decline.

nvanbenschoten · 2021-01-27T15:17:49Z

Rediscovered in #56851. Fixed by #56860.

tbg added the C-investigation Further steps needed to qualify. C-label will change. label Feb 26, 2019

bdarnell mentioned this issue Jul 29, 2019

core: Improve experience during cold starts in the presence of large datasets #39117

Closed

llllash mentioned this issue Apr 1, 2020

storage: replica_gc_queue cause the raft_enqueued_pending_queue blocked, making the cluster unavailable #46665

Closed

yangxuanjia mentioned this issue Apr 2, 2020

When all nodes in a cluster with a large number of replicas are restarted, the cluster becomes inactive #46660

Closed

nvanbenschoten closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Restarting nodes with 100k replicas causes unavailability #35063

stability: Restarting nodes with 100k replicas causes unavailability #35063

bdarnell commented Feb 19, 2019

bdarnell commented Feb 19, 2019

nvanbenschoten commented Jan 27, 2021

stability: Restarting nodes with 100k replicas causes unavailability #35063

stability: Restarting nodes with 100k replicas causes unavailability #35063

Comments

bdarnell commented Feb 19, 2019

bdarnell commented Feb 19, 2019

nvanbenschoten commented Jan 27, 2021