Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration: add scaffolding for the migrations manager #56368

Merged
merged 3 commits into from
Nov 11, 2020

Conversation

irfansharif
Copy link
Contributor

The commits in this PR were pulled out of our original prototype of the
migrations manager (#56107). These are the "boring bits", and simply introduces
the scaffolding for the manager without materially hooking it up to anything.
It will be fleshed out in future PRs, following the direction set by our
prototype above. For more details, see the RFC (#48843).

    sql: add scaffolding for version upgrade hook

    This callback will be called after validating a `SET CLUSTER SETTING
    version` but before executing it. It will be used in future PRs to
    execute arbitrary migrations to allow us to eventually remove code to
    support legacy behavior.

    This diff was pulled out of the long-running migrations prototype
    (#56107). For more details, see the RFC (#48843).
    migration: introduce pkg/migrations

    Package migration captures the facilities needed to define and execute
    migrations for a crdb cluster. These migrations can be arbitrarily long
    running, are free to send out arbitrary requests cluster wide, change
    internal DB state, and much more. They're typically reserved for crdb
    internal operations and state. Each migration is idempotent in nature,
    is associated with a specific cluster version, and executed when the
    cluster version is made activate on every node in the cluster.

    Examples of migrations that apply would be migrations to move all raft
    state from one storage engine to another, or purging all usage of the
    replicated truncated state in KV. A "sister" package of interest is
    pkg/sqlmigrations.

    ---

    This commit only introduces the basic scaffolding and wiring from
    existing functionality. We'll flesh in the missing bits in future
    commits.
    migration: plumb in a dialer, executor, kv.DB, and liveness

    The migration manager will need all of the above in order to execute
    migrations. It'll need:
    - A `Dialer`, to send out migration RPCs to individual nodes in the
      cluster.
    - A handle on `Liveness`, to figure out which nodes are part of the
      cluster
    - An `Executor`, for migrations to inspect/set internal SQL state, and
      to log progress into a dedicated system table
    - A `kv.DB`, for migrations to inspect/set internal KV state, and to
      send out Migrate requests to ranges for execute below-Raft migrations

    For more details, see the RFC (#48843). The fully "fleshed" out version
    of this manager was originally prototyped in #56107. This PR is simply
    pulling out the boring bits from there to move things along.

This callback will be called after validating a `SET CLUSTER SETTING
version` but before executing it. It will be used in future PRs to
execute arbitrary migrations to allow us to eventually remove code to
support legacy behavior.

This diff was pulled out of the long-running migrations prototype
(cockroachdb#56107). For more details, see the RFC (cockroachdb#48843).

Release note: None
@irfansharif irfansharif requested a review from a team as a code owner November 6, 2020 16:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Package migration captures the facilities needed to define and execute
migrations for a crdb cluster. These migrations can be arbitrarily long
running, are free to send out arbitrary requests cluster wide, change
internal DB state, and much more. They're typically reserved for crdb
internal operations and state. Each migration is idempotent in nature,
is associated with a specific cluster version, and executed when the
cluster version is made activate on every node in the cluster.

Examples of migrations that apply would be migrations to move all raft
state from one storage engine to another, or purging all usage of the
replicated truncated state in KV. A "sister" package of interest is
pkg/sqlmigrations.

---

This commit only introduces the basic scaffolding and wiring from
existing functionality. We'll flesh in the missing bits in future
commits.

Release note: None
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 5 of 5 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/migration/manager.go, line 46 at r2 (raw file):

// Registry. Be sure to key it in with the new cluster version we just added.
// During cluster upgrades, once the operator is able to set a cluster version
// setting that's past the version that was introduced (typically the major

at or past, right? If I attach a migration to 21.1.0-185, it will be run at the step going from 21.1.0-184 to 21.1.0-185. If I introduce VersionRemoveLegacyTombstone, I want to have removed legacy tombstones when that version is active.


pkg/migration/manager.go, line 79 at r2 (raw file):

	// TODO(irfansharif): Should we inject every ctx here with specific labels
	// for each migration, so they log distinctly? Do we need an AmbientContext?
	_ = logtags.AddTag(ctx, "migration-mgr", nil)

I like the idea of annotating the context with the current work item. I think no ambient context is needed, the caller can pass a populated context already.


pkg/migration/manager.go, line 111 at r2 (raw file):

		// TODO(irfansharif): We'll also want a testing override here to be able
		// to stub out migrations as needed.
		_ = Registry[version]

I think the most canonical way to get this is to return &Manager{registry: Registry} in NewManager and in tests, you simply put another registry in. No need for knobs at all.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I really like the idea of populating intermediate commits with API Stubs and "TODO" comments. This makes the review much easier. I will reuse your idea.

Reviewed 3 of 3 files at r1, 4 of 5 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

The migration manager will need all of the above in order to execute
migrations. It'll need:
- A `Dialer`, to send out migration RPCs to individual nodes in the
  cluster.
- A handle on `Liveness`, to figure out which nodes are part of the
  cluster
- An `Executor`, for migrations to inspect/set internal SQL state, and
  to log progress into a dedicated system table
- A `kv.DB`, for migrations to inspect/set internal KV state, and to
  send out Migrate requests to ranges for execute below-Raft migrations

For more details, see the RFC (cockroachdb#48843). The fully "fleshed" out version
of this manager was originally prototyped in cockroachdb#56107. This PR is simply
pulling out the boring bits from there to move things along.

Release note: None
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reuse your idea.

Heh, I remember getting this idea from you during a 1:1. Thanks for the reviews!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @tbg)


pkg/migration/manager.go, line 46 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

at or past, right? If I attach a migration to 21.1.0-185, it will be run at the step going from 21.1.0-184 to 21.1.0-185. If I introduce VersionRemoveLegacyTombstone, I want to have removed legacy tombstones when that version is active.

yep, here the thing I was getting at was that the user would only be able to set major versions as the active cluster version (so 21.1.0 for e.g.), and not the intermediate versions directly. I still need to do some work to properly introduce these fence versions, and the semantics of wanting to have removed legacy tombstones once VersionRemoveLegacyTombstone is active is also what I'm aiming to go for. TODO.

@craig
Copy link
Contributor

craig bot commented Nov 11, 2020

Build succeeded:

@craig craig bot merged commit 85a5db5 into cockroachdb:master Nov 11, 2020
@irfansharif irfansharif deleted the 201106.migration-scaffolding branch November 11, 2020 02:00
craig bot pushed a commit that referenced this pull request Nov 11, 2020
56476: server: introduce the `Migration` service r=irfansharif a=irfansharif

The upcoming migration manager (prototyped in #56107) will want to
execute a few known RPCs on every node in the cluster. Part of being the
"migration infrastructure", we also want authors of individual
migrations to be able to define arbitrary node-level operations to
execute on each node in the system.

To this end we introduce a `Migration` service, and populate it with the
two known RPCs the migration manager will want to depend on:
- ValidateTargetClusterVersion: used to verify that the target node is
  running a binary that's able to support the given cluster version.
- BumpClusterVersion: used to inform the target node about a (validated)
  cluster version bump.

Both these RPCs are not currently wired up to anything, and
BumpClusterVersion will be fleshed out just a tiny bit further in a
future PR, but they'll both be used to propagate cluster version bumps
across the crdb cluster through direct RPCs, supplanting our existing
gossip based distribution mechanism. This will let the migration manager
bump version gates in a more controlled fashion. See #56107 for what
that will end up looking like, and see the long-running migrations RFC
(#48843) for the motivation.

Like we mentioned earlier, we expect this service to pick up more RPCs
over time to service specific migrations.

Release note: None

---

Ignore the first four commits. They're from #56368 and #56474  

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
craig bot pushed a commit that referenced this pull request Nov 11, 2020
56480: settings,migration: disconnect cluster version from gossip r=irfansharif a=irfansharif

...in favor of direct RPCs to all nodes in the cluster. It uses the
building blocks we've added thus far to replace the use of gossip to
disseminate the cluster version. It does so by sending out individual
RPCs to each node in the cluster, informing them of a version bump, all
the while retaining the same guarantees provided by our (now previously)
gossip-backed mechanism. This is another in the series of PRs to
introduce long running migrations (#48843), pulled out of our original
prototype in #56107.

This diff has the following "pieces":
- We disconnect the version setting updates through gossip (by
  disconnecting the setting type within the updater process)
- We use the `Migration` service to send out RPCs to each node in the
  cluster, containing the payload that each node would otherwise receive
  through gossip. We do this by first introducing two primitives in
  pkg/migrations:
  - `RequiredNodes` retrieves a list of all nodes that are part of the
    cluster. It's powered by `pkg/../liveness`.
  - `EveryNode` is a shorthand that allows us to send out node-level
    migration RPCs to every node in the cluster.
  
  We combine these primitives with the RPCs introduced in #56476
  (`ValidateTargetClusterVersion`, `BumpClusterVersion`) to actually
  carry out the version bumps.
- We expand the `clusterversion.Handle` interface to allow setting the
  active version directly through it. We then make use of it in the
  implementation for `BumpClusterVersion`.
- Within `BumpClusterVersion`, we persists the cluster version received
  from the client node first, within `keys.StoreClusterVersionKey`, before
  bumping the version gate. This is a required invariant in the system
  in order for us to not regress our cluster version on restart. It was
  previously achieved by attaching callbacks on the version handle
  (`SetBeforeChange`).
- We no longer need the callbacks attached to gossip to persist cluster
  versions to disk. We're doing it as part of the `BumpClusterVersion`
  RPC. We remove them entirely.
- We use the active version provided by the join RPC to set the
  version setting directly (after having persisted it first). This too
  was previously achieved through gossip + the callback.

Release note: None

---

Only the last commit is of interest. All prior commits should be reviewed
across  #56476, #56474 and #56368.

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants