Skip to content

Commit

Permalink
kvserver: address migration concern with node liveness
Browse files Browse the repository at this point in the history
In #53842 we introduced a change to always persist a liveness record on
start up. As part of that change, we refactored how the liveness
heartbeat codepath dealt with missing liveness records: it knew to fetch
it from KV given we were now maintaining the invariant that it would
always be present. Except that wasn't necessarily true, as demonstrated
by the following scenario:

```
// - v20.1 node gets added to v20.1 cluster, and is quickly removed
//   before being able to persist its liveness record.
// - The cluster is upgraded to v20.2.
// - The node from earlier is rolled into v20.2, and re-added to the
//   cluster.
// - It's never able to successfully heartbeat (it didn't join
//   through the join rpc, bootstrap, or gossip). Welp.
```
Though admittedly unlikely, we should handle it all the same instead of
simply erroring out. We'll just fall back to creating the liveness
record in-place as we did in v20.1 code. We can remove this fallback in
21.1 code.

Release note: None
  • Loading branch information
irfansharif committed Sep 11, 2020
1 parent db627d7 commit b394338
Showing 1 changed file with 46 additions and 16 deletions.
62 changes: 46 additions & 16 deletions pkg/kv/kvserver/node_liveness.go
Original file line number Diff line number Diff line change
Expand Up @@ -793,7 +793,12 @@ func (nl *NodeLiveness) heartbeatInternal(
}
}

if oldLiveness == (kvserverpb.Liveness{}) {
// Let's compute what our new liveness record should be.
var newLiveness kvserverpb.Liveness
if oldLiveness != (kvserverpb.Liveness{}) {
// Start off with our existing view of liveness.
newLiveness = oldLiveness
} else {
// We don't yet know about our own liveness record (which does exist, we
// maintain the invariant that there's always a liveness record for
// every given node). Let's retrieve it from KV before proceeding.
Expand All @@ -804,25 +809,50 @@ func (nl *NodeLiveness) heartbeatInternal(
if err != nil {
return errors.Wrap(err, "unable to get liveness")
}
if kv.Value == nil {
return ErrNoLivenessRecord
}
if err := kv.Value.GetProto(&oldLiveness); err != nil {
return errors.Wrap(err, "invalid liveness record")
}

oldLivenessRec := LivenessRecord{
Liveness: oldLiveness,
raw: kv.Value.TagAndDataBytes(),
}
if kv.Value != nil {
// This is the happy path. Let's unpack the liveness record we found
// within KV, and use that to inform what our new liveness should
// be.
if err := kv.Value.GetProto(&oldLiveness); err != nil {
return errors.Wrap(err, "invalid liveness record")
}

// Offer it to make sure that when we actually try to update the
// liveness, the previous view is correct.
nl.maybeUpdate(oldLivenessRec)
oldLivenessRec := LivenessRecord{
Liveness: oldLiveness,
raw: kv.Value.TagAndDataBytes(),
}

// Update our cache with the liveness record we just found.
nl.maybeUpdate(oldLivenessRec)

newLiveness = oldLiveness
} else {
// This is a "should basically never happen" scenario given our
// invariant around always persisting liveness records on node
// startup. But that was a change we added in 20.2. Though unlikely,
// it's possible to get into the following scenario:
//
// - v20.1 node gets added to v20.1 cluster, and is quickly removed
// before being able to persist its liveness record.
// - The cluster is upgraded to v20.2.
// - The node from earlier is rolled into v20.2, and re-added to the
// cluster.
// - It's never able to successfully heartbeat (it didn't join
// through the join rpc, bootstrap, or gossip). Welp.
//
// Given this possibility, we'll just fall back to creating the
// liveness record here as we did in v20.1 code.
//
// TODO(irfansharif): Remove this once v20.2 is cut.
log.Warningf(ctx, "missing liveness record for n%d; falling back to creating it in-place", nodeID)
newLiveness = kvserverpb.Liveness{
NodeID: nodeID,
Epoch: 0, // incremented to epoch=1 below as needed
}
}
}

// Let's compute what our new liveness record should be.
newLiveness := oldLiveness
if incrementEpoch {
newLiveness.Epoch++
newLiveness.Draining = false // clear draining field
Expand Down

0 comments on commit b394338

Please sign in to comment.