Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Restarting nodes with 100k replicas causes unavailability #35063

Closed
bdarnell opened this issue Feb 19, 2019 · 2 comments
Closed

stability: Restarting nodes with 100k replicas causes unavailability #35063

bdarnell opened this issue Feb 19, 2019 · 2 comments
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@bdarnell
Copy link
Contributor

A forum user reported problems after restarting nodes in their cluster. The reason for the restart was to upgrade from 2.1.3 to 2.1.4, but downgrading to 2.1.3 did not fix the problem so it appears to be caused by the restart and not by the version upgrade. Notes were restarted one at a time with 1-2 minutes between them. Hardware configuration: 5x 12-vcpu nodes on GCP with 3 1TB SSDs attached to each (configured as separate stores). The cluster was idle at the time of the restart.

Symptoms during the outage:

  • The following metrics were much higher than the baseline and increased over the duration of the outage: CPU usage, disk write ops, raft append and election messages (MsgApp, MsgVote, MsgPreVote, and their responses), network bytes received.
  • These metrics were much higher than the baseline but relatively flat: KV transaction latency, node heartbeat latency, raft heartbeat messages, raft time, pending heartbeats
    During the outage, CPU usage and disk write ops were much higher than the previous baseline, and increased over time (disk writes increased at a steeper slope than cpu). KV transaction latency and node heartbeat latency p99s were high (over a second) but not increasing. Nodes were missing their heartbeats and getting marked as non-live, which led to confusing range counts in the UI (when nodes pop in and out of "live" status, there is confusion about who is responsible for counting a given range and they can be double-counted).

After about 9 hours, the cluster suddenly stabilized on its own. Most metrics returned to normal (and this cluster appeared to be functioning normally), but pending hearatbeats, raft time, and network bytes received stayed high for several more hours. During this phase the unquiesced replica count was high but gradually declining. When unquiesced replicas got down to about 33% of the total (rough eyeballing), raft time and pending heartbeats started to drop steeply. All of these metrics hit zero at about the same time, at which point the cluster appeared to be completely back to normal.

Questions for investigation:

  • Why is restarting nodes so disruptive? Do we need to throttle range wakeups to prevent them from overloading the cluster?
  • Why was disk I/O steadily increasing until it suddenly stopped?
  • After the recovery, I'd expect raft time and pending heartbeats to be proportional to unquiesced replicas. Why did those graphs have a different shape?

Since this was a planned restart, we could investigate the shutdown/draining process to make sure it's transferring leases away appropriately. This has a fixed 5s timeout (raftLeadershipTransferWait) which is likely too short for such a large replica count. This should probably be increased or made proportional to the number of replicas. However, not all restarts are planned so it's more important to ensure that even an unclean restart doesn't result in cluster-breaking amounts of traffic.

@bdarnell
Copy link
Contributor Author

I think this may be related to what I was talking about in this comment. The replica scanner can activate ranges without attempting to acquire a lease, which causes them to have raft elections, but if they don't get a lease they can't quiesce. (Eventually the consistency queue will force them to acquire a lease, but this runs on a 24h pace).

The biggest thing I can't explain at this point is why the raft election traffic increased steadily before dropping off suddenly (i think this was the root cause that led to the corresponding increases in raft append and cpu/disk utilization) during the 9 hours of the outage. My expectation is that the replica scanner issue would cause elections to peak 10m after the start of the incident and then gradually decline.

@nvanbenschoten
Copy link
Member

Rediscovered in #56851. Fixed by #56860.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

3 participants