Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix delaying of splits with uninitialized followers #64060

Merged
merged 2 commits into from
Apr 24, 2021

Commits on Apr 24, 2021

  1. kvserver: fix delaying of splits with uninitialized followers

    Bursts of splits (i.e. a sequence of splits for which each split splits
    the right-hand side of a prior split) can cause issues. This is because
    splitting a range in which a replica needs a snapshot results in two
    ranges in which a replica needs a snapshot where additionally there
    needs to be a sequencing between the two snapshots (one cannot apply
    a snapshot for the post-split replica until the pre-split replica has
    moved out of the way). The result of a long burst of splits such as
    occurring in RESTORE and IMPORT operations is then an overload of the
    snapshot queue with lots of wasted work, unavailable followers with
    operations hanging on them, and general mayhem in the logs. Since
    bulk operations also write a lot of data to the raft logs, log
    truncations then create an additional snapshot burden; in short,
    everything will be unhappy for a few hours and the cluster may
    effectively be unavailable.
    
    This isn't news to us and in fact was a big problem "back in 2018".
    When we first started to understand the issue, we introduced a mechanism
    that would delay splits (cockroachdb#32594) with the desired effect of ensuring
    that, all followers had caught up to ~all of the previous splits.
    This helped, but didn't explain why we were seeing snapshots in the
    first place.
    
    Investigating more, we realized that snapshots were sometimes spuriously
    requested by an uninitialized replica on the right-hand side which was
    contacted by another member of the right-hand side that had already been
    initialized by the split executing on the left-hand side; this snapshot
    was almost always unnecessary since the local left-hand side would
    usually initialize the right-hand side moments later.  To address this,
    in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
    of requests to an uninitialized range, and the mechanism was improved in
     cockroachdb#32782 and will now only do this if a local neighboring replica is
    expected to perform the split soon.
    
    With all this in place, you would've expected us to have all bases
    covered but it turns out that we are still running into issues prior
    to this PR.
    
    Concretely, whenever the aforementioned mechanism drops a message from
    the leader (a MsgApp), the leader will only contact the replica every
    second until it responds. It responds when it has been initialized via
    its left neighbor's splits and the leader reaches out again, i.e.  an
    average of ~500ms after being initialized. However, by that time, it is
    itself already at the bottom of a long chain of splits, and the 500ms
    delay is delaying how long it takes for the rest of the chain to get
    initialized.  Since the delay compounds on each link of the chain, the
    depth of the chain effectively determines the total delay experienced at
    the end. This would eventually exceed the patience of the mechanism that
    would suppress the snapshots, and snapshots would be requested. We would
    descend into madness similar to that experienced in the absence of the
    mechanism in the first place.
    
    The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
    did not, as it routinely missed the fact that followers were not
    initialized yet. This is because during a split burst, the replica
    orchestrating the split was typically only created an instant before,
    and its raft group hadn't properly transitioned to leader status yet.
    This meant that in effect it wasn't delaying the splits at all.
    
    This commit adjusts the logic to delay splits to avoid this problem.
    While clamoring for leadership, the delay is upheld. Once collapsed
    into a definite state, the existing logic pretty much did the right
    thing, as it waited for the right-hand side to be in initialized.
    
    Release note (bug fix): Fixed a scenario in which a rapid sequence
    of splits could trigger a storm of Raft snapshots. This would be
    accompanied by log messages of the form "would have dropped incoming
    MsgApp, but allowing due to ..." and tended to occur as part of
    RESTORE/IMPORT operations.
    tbg committed Apr 24, 2021
    Configuration menu
    Copy the full SHA
    f21720a View commit details
    Browse the repository at this point in the history
  2. kvserver: enqueue update check after Campaign()

    I noticed that TestSplitBurstWithSlowFollower would average only
    <10 splits per second even when no lagging replica is introduced (i.e.
    the `time.Sleep` commented out). Investigating the raft chatter
    suggested that after campaigning, the leaseholder of the right-hand
    side would not be processed by its Store for a `Ready` handling cycle
    until after ~50ms (the coalesced heartbeat timeout) in the test had
    passed, and a similar delay was observed when it was sending out its
    votes. Adding a call to `enqueueRaftUpdateCheck` fixes both, we end
    up at ~100 splits per second.
    
    Release note: None
    tbg committed Apr 24, 2021
    Configuration menu
    Copy the full SHA
    9282a53 View commit details
    Browse the repository at this point in the history