Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.2: server: always create a liveness record before starting up #54212

Merged
merged 3 commits into from
Sep 15, 2020

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Sep 10, 2020

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

I'll wait for a bit to let #53842 bake on master. I'll also wait for #54216 to land, and I'll group that here as well.

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for cockroachdb#53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None
The heartbeat loop depends on gossip to retrieve the node ID. When
stressing a few tests that make use of LocalTestCluster, I was seeing
empty liveness records for empty node IDs being heartbeated. By
re-ordering things as such we bring it closer to the Server
initialization ordering.

Release note: None
In cockroachdb#53842 we introduced a change to always persist a liveness record on
start up. As part of that change, we refactored how the liveness
heartbeat codepath dealt with missing liveness records: it knew to fetch
it from KV given we were now maintaining the invariant that it would
always be present. Except that wasn't necessarily true, as demonstrated
by the following scenario:

```
// - v20.1 node gets added to v20.1 cluster, and is quickly removed
//   before being able to persist its liveness record.
// - The cluster is upgraded to v20.2.
// - The node from earlier is rolled into v20.2, and re-added to the
//   cluster.
// - It's never able to successfully heartbeat (it didn't join
//   through the join rpc, bootstrap, or gossip). Welp.
```
Though admittedly unlikely, we should handle it all the same instead of
simply erroring out. We'll just fall back to creating the liveness
record in-place as we did in v20.1 code. We can remove this fallback in
21.1 code.

Release note: None
@irfansharif
Copy link
Contributor Author

I'll wait for a bit to let #53842 bake on master. I'll also wait for #54216 to land, and I'll group that here as well.

Done. @tbg is on vacation through this next week, and I'm fine waiting for this backport to land, but we do want this to be included in 20.2.

Release justification: low risk, high benefit changes to existing functionality

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: these changes were cleaner than I was expecting and get us to a much better place with respect to the invariants around node liveness records. Nice job.

Reviewed 10 of 10 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/node_liveness.go, line 804 at r3 (raw file):

		// We don't yet know about our own liveness record (which does exist, we
		// maintain the invariant that there's always a liveness record for
		// every given node). Let's retrieve it from KV before proceeding.

This sentence is a little off, given the workaround you had to add in commit 3. It's probably not worth fixing though.

@irfansharif
Copy link
Contributor Author

I'll send a patch for master, thanks for looking!

@irfansharif irfansharif merged commit 4f0d427 into cockroachdb:release-20.2 Sep 15, 2020
@irfansharif irfansharif deleted the backport20.2-53842 branch September 15, 2020 19:44
@rafiss rafiss added this to the 20.2 milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants