release-20.2: server: always create a liveness record before starting up #54212

irfansharif · 2020-09-10T18:56:40Z

Backport:

1/1 commits from "server: always create a liveness record before starting up" (server: always create a liveness record before starting up #53842)
1/1 commits from "localtestcluster: re-order setting of gossip descriptor" (localtestcluster: re-order setting of gossip descriptor #54224)
1/1 commits from "kvserver: address migration concern with node liveness" (kvserver: address migration concern with node liveness #54216)

Please see individual PRs for details.

/cc @cockroachdb/release

cockroach-teamcity · 2020-09-10T18:56:47Z

This change is

irfansharif · 2020-09-10T21:03:39Z

I'll wait for a bit to let #53842 bake on master. I'll also wait for #54216 to land, and I'll group that here as well.

Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). Having a liveness record for each node always present is a crucial building block for long running migrations (cockroachdb#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. --- This is an alternative implementation for cockroachdb#53805. Here we choose to manually write the liveness record for the bootstrapping node when writing initial cluster data. For all other nodes, we do it on the server-side of the join RPC. We're also careful to do it in the legacy codepath when joining a cluster through gossip. Release note: None

The heartbeat loop depends on gossip to retrieve the node ID. When stressing a few tests that make use of LocalTestCluster, I was seeing empty liveness records for empty node IDs being heartbeated. By re-ordering things as such we bring it closer to the Server initialization ordering. Release note: None

In cockroachdb#53842 we introduced a change to always persist a liveness record on start up. As part of that change, we refactored how the liveness heartbeat codepath dealt with missing liveness records: it knew to fetch it from KV given we were now maintaining the invariant that it would always be present. Except that wasn't necessarily true, as demonstrated by the following scenario: ``` // - v20.1 node gets added to v20.1 cluster, and is quickly removed // before being able to persist its liveness record. // - The cluster is upgraded to v20.2. // - The node from earlier is rolled into v20.2, and re-added to the // cluster. // - It's never able to successfully heartbeat (it didn't join // through the join rpc, bootstrap, or gossip). Welp. ``` Though admittedly unlikely, we should handle it all the same instead of simply erroring out. We'll just fall back to creating the liveness record in-place as we did in v20.1 code. We can remove this fallback in 21.1 code. Release note: None

irfansharif · 2020-09-11T15:44:54Z

I'll wait for a bit to let #53842 bake on master. I'll also wait for #54216 to land, and I'll group that here as well.

Done. @tbg is on vacation through this next week, and I'm fine waiting for this backport to land, but we do want this to be included in 20.2.

Release justification: low risk, high benefit changes to existing functionality

nvanbenschoten

these changes were cleaner than I was expecting and get us to a much better place with respect to the invariants around node liveness records. Nice job.

Reviewed 10 of 10 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

pkg/kv/kvserver/node_liveness.go, line 804 at r3 (raw file):

		// We don't yet know about our own liveness record (which does exist, we
		// maintain the invariant that there's always a liveness record for
		// every given node). Let's retrieve it from KV before proceeding.

This sentence is a little off, given the workaround you had to add in commit 3. It's probably not worth fixing though.

irfansharif · 2020-09-15T19:44:00Z

I'll send a patch for master, thanks for looking!

irfansharif requested a review from tbg September 10, 2020 18:56

irfansharif added the backport-20.2.x label Sep 10, 2020

irfansharif mentioned this pull request Sep 10, 2020

20.2 Release Backports Request List #53662

Closed

40 tasks

irfansharif added 3 commits September 11, 2020 11:40

irfansharif force-pushed the backport20.2-53842 branch from 1160c28 to 683f713 Compare September 11, 2020 15:41

irfansharif requested a review from nvanbenschoten September 15, 2020 16:37

nvanbenschoten approved these changes Sep 15, 2020

View reviewed changes

irfansharif merged commit 4f0d427 into cockroachdb:release-20.2 Sep 15, 2020

irfansharif deleted the backport20.2-53842 branch September 15, 2020 19:44

rafiss added this to the 20.2 milestone Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.2: server: always create a liveness record before starting up #54212

release-20.2: server: always create a liveness record before starting up #54212

irfansharif commented Sep 10, 2020 •

edited

Loading

cockroach-teamcity commented Sep 10, 2020

irfansharif commented Sep 10, 2020

irfansharif commented Sep 11, 2020

nvanbenschoten left a comment

irfansharif commented Sep 15, 2020

release-20.2: server: always create a liveness record before starting up #54212

release-20.2: server: always create a liveness record before starting up #54212

Conversation

irfansharif commented Sep 10, 2020 • edited Loading

cockroach-teamcity commented Sep 10, 2020

irfansharif commented Sep 10, 2020

irfansharif commented Sep 11, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

irfansharif commented Sep 15, 2020

irfansharif commented Sep 10, 2020 •

edited

Loading