settings,migration: disconnect cluster version from gossip #56480

irfansharif · 2020-11-10T04:22:44Z

...in favor of direct RPCs to all nodes in the cluster. It uses the
building blocks we've added thus far to replace the use of gossip to
disseminate the cluster version. It does so by sending out individual
RPCs to each node in the cluster, informing them of a version bump, all
the while retaining the same guarantees provided by our (now previously)
gossip-backed mechanism. This is another in the series of PRs to
introduce long running migrations (#48843), pulled out of our original
prototype in #56107.

This diff has the following "pieces":

We disconnect the version setting updates through gossip (by
disconnecting the setting type within the updater process)
We use the Migration service to send out RPCs to each node in the
cluster, containing the payload that each node would otherwise receive
through gossip. We do this by first introducing two primitives in
pkg/migrations:
- RequiredNodes retrieves a list of all nodes that are part of the
  cluster. It's powered by pkg/../liveness.
- EveryNode is a shorthand that allows us to send out node-level
  migration RPCs to every node in the cluster.
We combine these primitives with the RPCs introduced in server: introduce the Migration service #56476
(ValidateTargetClusterVersion, BumpClusterVersion) to actually
carry out the version bumps.
We expand the clusterversion.Handle interface to allow setting the
active version directly through it. We then make use of it in the
implementation for BumpClusterVersion.
Within BumpClusterVersion, we persists the cluster version received
from the client node first, within keys.StoreClusterVersionKey, before
bumping the version gate. This is a required invariant in the system
in order for us to not regress our cluster version on restart. It was
previously achieved by attaching callbacks on the version handle
(SetBeforeChange).
We no longer need the callbacks attached to gossip to persist cluster
versions to disk. We're doing it as part of the BumpClusterVersion
RPC. We remove them entirely.
We use the active version provided by the join RPC to set the
version setting directly (after having persisted it first). This too
was previously achieved through gossip + the callback.

Release note: None

Only the last commit is of interest. All prior commits should be reviewed
across #56476, #56474 and #56368.

cockroach-teamcity · 2020-11-10T04:22:56Z

This change is

tbg

Still a little bit of work left, but excitingly close, good job!

Reviewed 2 of 3 files at r1, 1 of 5 files at r2, 2 of 3 files at r3, 1 of 3 files at r4, 9 of 9 files at r5, 15 of 15 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

pkg/clusterversion/clusterversion.go, line 125 at r6 (raw file):

	//
	//  If a version vX is active on a given server, upon restart, the version
	//  that is immediately active must be >= vX.

maybe mention that in practice it's equal to vX? As written one might wonder if it can spontaneously bump itself, which of course it cannot.

pkg/clusterversion/clusterversion.go, line 192 at r6 (raw file):

	// called on cluster versions received from other nodes (where `SET CLUSTER
	// SETTING version` was originally called). The stricter form of validation
	// happens there. SetActiveVersion simply the cluster version bump that

SetActiveVersion is

pkg/migration/manager.go, line 104 at r6 (raw file):

}

// EveryNode lets migrations execute the given EveryNodeOp across every node in

// EveryNode invokes the given closure (named by the informational parameter op) across every node in the cluster.

I think we need to address the todo about the load/run/load thing or we regress, compared to gossip, if a node slips in at the wrong time.

There are still issues here. For the version bump, the join RPC gives us the condition we need: at some point, nodes joining the cluster have the right version already, so we only have to see the node list stabilize once ,and then know that any future members will pick up the version via Join. For cleanup operations, it's not so clear cut. For example, if we're clearing out some data that can be created when the active version is below X, we can't clear it out in the migration for X (since that migration will run before X is guaranteed active anywhere). But I think that is fine, in such cases we "just" need to split the migration in two: one empty migration that gates using the "new way" to store the data (but the old one can still be around), and then the migration that cleans up.
Something to keep an eye out as we onboard the first migrations. Perhaps as a general rule, versions that attach a migration cannot mix with versions that gate code behavior? Let's revisit later.
For now, EveryNode seems like a great place to house a big comment on how this "every node" thing is achieved in the presence of randomly joining nodes.

pkg/migration/manager.go, line 175 at r6 (raw file):

	// any, we'll be want to assemble the list of remaining migrations to step
	// through to get to targetV.
	var vs = []roachpb.Version{targetV}

Once we have migrations, it's very important to heed these noop-bumps, so we will probably want to write code that groups adjacent noop bumps together. For example, if we have

X-1 noop
X-2 noop
X-3 migration
X-4 noop
X-5 noop
X-6 migration

We would see target versions X-2, X-3, X-5, X-6. (and recall that we need to ensure that migrations are never directly adjacent).

I just mention this because now that targetV is in this slice, there's room for us to mess this up by not revisiting once we actually do invoke migrations. Perhaps just augment the wording that we HAVE to fix this or things will be incorrect.

pkg/migration/manager.go, line 185 at r6 (raw file):

		// that every node in the cluster has the corresponding version
		// activated.
		{

Nice to see this come together.

pkg/server/init.go, line 303 at r6 (raw file):

			log.Infof(ctx, "joined cluster %s through join rpc", state.clusterID)
			log.Infof(ctx, "received node ID %d", state.nodeID)
			log.Infof(ctx, "active cluster version: %s", state.clusterVersion)

is this now active though? Who makes it active?

pkg/server/server.go, line 1393 at r6 (raw file):

	}

	if state.clusterVersion != initialDiskClusterVersion {

Can we do this unconditionally, remove the initialization as a reaction to BootstrapReq, and also do the logging here? That way there are only two bumps: 1. initialize as min supported version 2. this code, which seems nice.

pkg/server/version_cluster_test.go, line 433 at r6 (raw file):

	}

	// The other nodes are just as careful.

🙏 always love seeing a "wait for node to die" go away. @danhhz would be proud

pkg/settings/updater.go, line 92 at r6 (raw file):

	// functionality for the version setting update process (not doing it
	// through updater, as here, but instead using a dedicated RPC for it).
	// Still, this is a bit unfortunate, and maybe merits reverting the type

yeah, let's revert that.

pkg/settings/updater.go, line 137 at r6 (raw file):

		// setting, changes to which are propagated through direct RPCs to each
		// node in the cluster instead of gossip. This is done using the
		// EveryNode(op=AckClusterVersion) RPC.

it's the AckClusterVersion RPC now. (Or something like that).

pkg/settings/version.go, line 30 at r6 (raw file):

// TODO(irfansharif): If the cluster version is no longer backed by gossip,
// maybe we should stop pretending it's a regular gossip-backed cluster setting.
// We could introduce new syntax here to motivate this shift.

File an issue and assign it to the right SQL team (I never quite know which team takes these issues).

pkg/sql/show_cluster_setting.go, line 42 at r6 (raw file):

	// For the version setting we show the value from the KV store and
	// additionally wait for the local setting instance to have observed the
	// value as well (gets updated through the `EveryNode(op=AckClusterVersion)`

RPC is now named differently

irfansharif

PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @knz, and @tbg)

pkg/clusterversion/clusterversion.go, line 125 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

maybe mention that in practice it's equal to vX? As written one might wonder if it can spontaneously bump itself, which of course it cannot.

Done.

pkg/clusterversion/clusterversion.go, line 192 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

SetActiveVersion is

Done.

pkg/migration/manager.go, line 104 at r6 (raw file):

I think we need to address the todo about the load/run/load thing or we regress, compared to gossip, if a node slips in at the wrong time.

Good catch, done.

And yup, the migration split across multiple versions is what we had in mind as far back as #39182. If it's a common enough pattern we can introduce some "MultiStepMigration" abstraction here, and lean on these fence versions we'll need anyway.

For now, EveryNode seems like a great place to house a big comment on how this "every node" thing is achieved in the presence of randomly joining nodes.

Done.

pkg/migration/manager.go, line 175 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Once we have migrations, it's very important to heed these noop-bumps, so we will probably want to write code that groups adjacent noop bumps together. For example, if we have

X-1 noop
X-2 noop
X-3 migration
X-4 noop
X-5 noop
X-6 migration

We would see target versions X-2, X-3, X-5, X-6. (and recall that we need to ensure that migrations are never directly adjacent).

I just mention this because now that targetV is in this slice, there's room for us to mess this up by not revisiting once we actually do invoke migrations. Perhaps just augment the wording that we HAVE to fix this or things will be incorrect.

Yup, definitely need to do a bit of legwork to introduce these fence versions. Was just being lazy and deferring it to when we'll write our first real migration. Added a TODO.

pkg/server/init.go, line 303 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is this now active though? Who makes it active?

I suppose it doesn't happen right here. It happens just after, at the caller. Amended the log message.

state, initialStart, err := initServer.ServeAndWait(ctx, s.stopper, &s.cfg.Settings.SV)
// ...

if state.clusterVersion != initialDiskClusterVersion {
  // We just learned about a cluster version different from the one we
  // found on/synthesized from disk. This indicates that we're either the
  // bootstrapping node (and are using the binary version as the cluster
  // version), or we're joining an existing cluster that just informed us
  // to activate the given cluster version.
  //
  // Either way, we'll do so by first persisting the cluster version
  // itself, and then informing the version setting about it (an invariant
  // we must up hold whenever setting a new active version).
  if err := kvserver.WriteClusterVersionToEngines(
    ctx, s.engines, state.clusterVersion,
  ); err != nil {
    return err
  }

  if err := s.ClusterSettings().Version.SetActiveVersion(ctx, state.clusterVersion); err != nil {
    return err
  }
}

pkg/server/server.go, line 1393 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can we do this unconditionally, remove the initialization as a reaction to BootstrapReq, and also do the logging here? That way there are only two bumps: 1. initialize as min supported version 2. this code, which seems nice.

I like it. Added a TODO for now, will get to it after I'm back.

pkg/settings/updater.go, line 137 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

it's the AckClusterVersion RPC now. (Or something like that).

Done.

pkg/settings/version.go, line 30 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

File an issue and assign it to the right SQL team (I never quite know which team takes these issues).

Done. #56542.

pkg/sql/show_cluster_setting.go, line 42 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

RPC is now named differently

Done.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @knz, and @tbg)

pkg/settings/updater.go, line 92 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

yeah, let's revert that.

#56546.

tbg

Reviewed 13 of 13 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)

pkg/clusterversion/clusterversion.go, line 126 at r7 (raw file):

	//  If a version vX is active on a given server, upon restart, the version
	//  that is immediately active must be >= vX (in practice it'll almost
	//  always be vX).

almost? Not always?

pkg/migration/manager.go, line 115 at r7 (raw file):

//
// By the time EveryNode returns, we'll have thus invoked the closure against
// every node in the cluster.

... though right after that moment, a new node may join. This means that some migrations may have to be split up into two version bumps: one that phases out the old version (i.e. stops creation of stale data or behavior) and a clean-up version, which removes any vestiges of the stale data/behavior, and which, when active, ensures that the old data has vanished from the system. This is similar in spirit to how schema changes are split up into multiple smaller steps that are carried out sequentially.

pkg/migration/util.go, line 18 at r7 (raw file):

)

// identical returns whether or not two lists of node IDs are identical.

nit: "are identical as sets".

...in favor of direct RPCs to all nodes in the cluster. It uses the building blocks we've added thus far to replace the use of gossip to disseminate the cluster version. It does so by sending out individual RPCs to each node in the cluster, informing them of a version bump, all the while retaining the same guarantees provided by our (now previously) gossip-backed mechanism. This diff has the following "pieces": - We disconnect the version setting updates through gossip (by disconnecting the setting type within the updater process) - We use the `Migration` service to send out RPCs to each node in the cluster, containing the payload that each node would otherwise receive through gossip. We do this by first introducing two primitives in pkg/migrations: - `RequiredNodes` retrieves a list of all nodes that are part of the cluster. It's powered by `pkg/../liveness`. - `EveryNode` is a shorthand that allows us to send out node-level migration RPCs to every node in the cluster. We combine these primitives with the RPCs introduced in cockroachdb#56476 (`ValidateTargetClusterVersion`, `BumpClusterVersion`) to actually carry out the version bumps. - We expand the `clusterversion.Handle` interface to allow setting the active version directly through it. We then make use of it in the implementation for `BumpClusterVersion`. - Within `BumpClusterVersion`, we persists the cluster version received from the client node first, within `keys.StoreClusterVersionKey`, before bumping the version gate. This is a required invariant in the system in order for us to not regress our cluster version on restart. It was previously achieved by attaching callbacks on the version handle (`SetBeforeChange`). - We no longer need the callbacks attached to gossip to persist cluster versions to disk. We're doing it as part of the `BumpClusterVersion` RPC. We remove them entirely. - We use the active version provided by the join RPC to set the version setting directly (after having persisted it first). This too was previously achieved through gossip + the callback. Release note: None

irfansharif

Thanks!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/clusterversion/clusterversion.go, line 126 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

almost? Not always?

I was just being particular about the fact that we're not persisting and setting the active version atomically, we do one before the other. There's a split second in between where we could crash and we would restart with vX+1.

pkg/migration/manager.go, line 115 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

... though right after that moment, a new node may join. This means that some migrations may have to be split up into two version bumps: one that phases out the old version (i.e. stops creation of stale data or behavior) and a clean-up version, which removes any vestiges of the stale data/behavior, and which, when active, ensures that the old data has vanished from the system. This is similar in spirit to how schema changes are split up into multiple smaller steps that are carried out sequentially.

Done.

craig · 2020-11-11T13:08:49Z

Build succeeded:

GitHub CI (Cockroach)

56546: settings: patch backwards incompatible short type identifier change r=irfansharif a=irfansharif We introduced a regression in #55994. In mixed-version clusters, when the 21.1 node attempts to refresh settings, it expects to find a type "v" for the version setting, but finds "m" already present in 20.2. We revert the bits of #55994 that introduced this regression. Release note (sql, cli change): In an earlier commit (3edd70b, not part of any release, yet) we introduced a regression by representing the shortened form of the cluster version setting's type as "v", from an earlier "m". It's now back to what it was. Specifically: - The `setting_type` column for `version` in `SHOW CLUSTER SETTINGS` will now show an "m" instead of a "v" - The `valueType` column for `version` in `system.settings` will now show an "m" instead of a "v" --- First commit is from #56480. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

irfansharif requested a review from a team as a code owner November 10, 2020 04:22

irfansharif requested review from tbg and knz November 10, 2020 04:22

irfansharif force-pushed the 201109.disconnect-cv-gossip branch from 6b7ee4c to a099d7e Compare November 10, 2020 04:37

tbg reviewed Nov 10, 2020

View reviewed changes

irfansharif mentioned this pull request Nov 11, 2020

sql: introduce new syntax to set the cluster version #56542

Open

irfansharif force-pushed the 201109.disconnect-cv-gossip branch from a099d7e to 2d3712b Compare November 11, 2020 05:19

irfansharif commented Nov 11, 2020

View reviewed changes

irfansharif mentioned this pull request Nov 11, 2020

settings: patch backwards incompatible short type identifier change #56546

Merged

irfansharif commented Nov 11, 2020

View reviewed changes

tbg self-requested a review November 11, 2020 09:05

tbg approved these changes Nov 11, 2020

View reviewed changes

irfansharif force-pushed the 201109.disconnect-cv-gossip branch from 2d3712b to cea6cad Compare November 11, 2020 12:14

irfansharif commented Nov 11, 2020

View reviewed changes

craig bot merged commit f77d722 into cockroachdb:master Nov 11, 2020

irfansharif deleted the 201109.disconnect-cv-gossip branch November 11, 2020 13:35

aayushshah15 mentioned this pull request Jan 12, 2021

roachtest: decommission/mixed-versions failed #58523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

settings,migration: disconnect cluster version from gossip #56480

settings,migration: disconnect cluster version from gossip #56480

irfansharif commented Nov 10, 2020 •

edited

Loading

cockroach-teamcity commented Nov 10, 2020

tbg left a comment

irfansharif left a comment

irfansharif left a comment

tbg left a comment

irfansharif left a comment

craig bot commented Nov 11, 2020

settings,migration: disconnect cluster version from gossip #56480

settings,migration: disconnect cluster version from gossip #56480

Conversation

irfansharif commented Nov 10, 2020 • edited Loading

cockroach-teamcity commented Nov 10, 2020

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

craig bot commented Nov 11, 2020

irfansharif commented Nov 10, 2020 •

edited

Loading