Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] migrations: introduce new system which supports long-running migrations #44249

Closed
wants to merge 2 commits into from

Conversation

danhhz
Copy link
Contributor

@danhhz danhhz commented Jan 23, 2020

This subsumes the cluster version cluster setting. There is a large
overlap with pkg/sqlmigrations, which should be cleaned up in the
future.

Closes #39182

Release note (<category, see below>):

@danhhz danhhz requested a review from tbg January 23, 2020 00:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@danhhz
Copy link
Contributor Author

danhhz commented Jan 23, 2020

@tbg this is a very early draft and mostly unfinished, but I wanted to get it out in time to get your initial feedback tomorrow

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 14 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/migration/migration.go, line 52 at r1 (raw file):

}

func init() {

We've amassed a larger list of migrations that fit this bill, right? For me personally I think the most useful thing would be to see them all stubbed out here, to have a good sample of the kinds of things we're expecting the migrations to need to do. The separate raft log engine migration for example, or some migration that uses #42822, and whatever other ones we've mentioned.


pkg/migration/connect/connect.go, line 30 at r1 (raw file):

	// (filtering out self, etc). On one hand it'd be nice to keep this from
	// depending on gossip (so the response could be _used_ by gossip), but it
	// would be unfortuante to duplicate this initial work.

Yeah the joining process is quite complex (think about resolvers, etc), that was also my main worry over in #32574
It's worth figuring out whether we're allowed to simplify the semantics. Such as: a node that starts up must be able to reach another initialized (or soon to be initialized) node via its join flags. For example it would be illegal to start a new node with only bogus join flags and to rely on it being mentioned in other nodes' join flags (who could then reach out to it).


pkg/migration/everynode/everynode.go, line 72 at r1 (raw file):

// RunHookOnEveryNode executes the requested hook closure on each node in the
// cluster and each node added to the cluster in the future. If no error is
// returned, the caller is guaranteed that no node will ever again serve SQL or

Make this stronger: connect to the cluster

Lots of stuff happens before "serving SQL or KV traffic". We want to preempt all of the moving parts here.


pkg/migration/everynode/everynode.go, line 79 at r1 (raw file):

// RPC and returned a successful response.
//
// WIP how do we communicate to the user if this is blocking while a node is

I was hoping the job could have a good description of where it's at, but maybe that's not enough


pkg/migration/everynode/everynode.go, line 82 at r1 (raw file):

// down?
//
// WIP what about nodes that are decommissioned and later re-added? Is this even

It's "legal" today but won't be legal in the future. If we wanted to prevent this, we add a NodeID graveyard against which the incoming Connect calls are checked (init'ed nodes will put their NodeID into the RPC). As is, we'd be relying on the operator to just not bring decommissioned nodes back, which I agree we'll ultimately have to fix.


pkg/migration/fflag/fflag.go, line 39 at r1 (raw file):

//
// WIP how does this work for cluster init? separate method in this package?
func GetHandle(bootVersion ClusterVersion) *Handle {

NewHandle?


pkg/migration/fflag/fflag.go, line 47 at r1 (raw file):

However, a feature that is active once is guaranteed to be active for eternity.

Still true, right? Regarding the "feature flag" framing, are feature flags generally thought of as reversible? Because here they aren't. In general, as we release new functionality, as long as it's a true feature flag, this means we'd keep it out of.. these feature flags (but use a cluster setting or such instead)? I wonder if that could be confusing - the only feature flags that we have in the migrations are the non-reversible ones, but I'm not so sure I'd think of them as "feature flags" any more... it's more of a multi-frontier migration, where you can migrate into features individually.

I read https://martinfowler.com/articles/feature-toggles.html to gut-check my intuition and there it seems that a "feature flag" should be a simple switch that goes both ways. So maybe we should reconsider the naming here and stick to "migrations" naming while perhaps in the future adding a separate mechanism for true feature flags once we feel that is something we need (not today).


pkg/migration/fflag/fflag.go, line 97 at r1 (raw file):

		Version: roachpb.Version{Major: 19, Minor: 2, Unstable: 3},
	},
	{

Since we're here, it has always bothered me that this was growing the wrong direction. It'd be easier if wereversed this, then the VersionKey and the new version are next to each other in code.


pkg/server/server.go, line 213 at r1 (raw file):

func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
	// WIP do we know way up here whether this node is bootstrapping the cluster
	// or not?

No, that happens only way down, when we .start the Node:

cockroach/pkg/server/server.go

Lines 1341 to 1368 in 46a38c9

bootstrappedEngines, _, _, err := inspectEngines(
ctx, s.engines,
cluster.Version.BinaryMinSupportedVersion(s.cfg.Settings),
cluster.Version.BinaryVersion(s.cfg.Settings),
&s.rpcContext.ClusterID)
if err != nil {
return errors.Wrap(err, "inspecting engines")
}
// Filter the gossip bootstrap resolvers based on the listen and
// advertise addresses.
listenAddrU := util.NewUnresolvedAddr("tcp", s.cfg.Addr)
advAddrU := util.NewUnresolvedAddr("tcp", s.cfg.AdvertiseAddr)
advSQLAddrU := util.NewUnresolvedAddr("tcp", s.cfg.SQLAdvertiseAddr)
filtered := s.cfg.FilterGossipBootstrapResolvers(ctx, listenAddrU, advAddrU)
s.gossip.Start(advAddrU, filtered)
log.Event(ctx, "started gossip")
if s.cfg.DelayedBootstrapFn != nil {
defer time.AfterFunc(30*time.Second, s.cfg.DelayedBootstrapFn).Stop()
}
var hlcUpperBoundExists bool
// doBootstrap is set if we're the ones who bootstrapped the cluster.
var doBootstrap bool
if len(bootstrappedEngines) > 0 {
// The cluster was already initialized.
doBootstrap = false

There's no reason this couldn't be lifted, though. Note also that the logic will simplify after #44112: if we have no initialized engines, we go into init mode (i.e. wait for a bootstrap cmd or for one of our peers to become initialized). If there are init'ed engines, we're already set up. So no need to look at the join flags, though it raises the question of what it means to start a node without a join flag.

The natural sequencing here is to let Raphael's PR merge (quite a bit of work involved in that one, too). Then, pull out the determination of whether to bootstrap out into something we can query before the call to Connect. If we're bootstrapped, send NodeID in connect (otherwise don't and let Connect alloc a NodeID and StoreIDs for you). This actually boils down cleanly to "do #32574"? Adding the migration stuff you need should be straightforward after that.

By the way, I would be quite excited to have that happen. That work item has always gotten in the way of seriously thinking about cleaning up our server startup code.


pkg/server/server.go, line 556 at r1 (raw file):

		txnMetrics, nil /* execCfg */, &s.rpcContext.ClusterID)

	// WIP is this too late in the boot process?

Yeah definitely, but also what can be in a hook? You want to run this early (to avoid setting up all of those moving parts), so all you'll have is the slice of engine.Engines, you can't do KV, you don't have your *Node or even your *Server set up.

Copy link
Contributor Author

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/migration/fflag/fflag.go, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

However, a feature that is active once is guaranteed to be active for eternity.

Still true, right? Regarding the "feature flag" framing, are feature flags generally thought of as reversible? Because here they aren't. In general, as we release new functionality, as long as it's a true feature flag, this means we'd keep it out of.. these feature flags (but use a cluster setting or such instead)? I wonder if that could be confusing - the only feature flags that we have in the migrations are the non-reversible ones, but I'm not so sure I'd think of them as "feature flags" any more... it's more of a multi-frontier migration, where you can migrate into features individually.

I read https://martinfowler.com/articles/feature-toggles.html to gut-check my intuition and there it seems that a "feature flag" should be a simple switch that goes both ways. So maybe we should reconsider the naming here and stick to "migrations" naming while perhaps in the future adding a separate mechanism for true feature flags once we feel that is something we need (not today).

Makes sense, I just needed a new package name for dep reasons. Does something like feature gate feel less flip-floppy or should I stay away from "feature" in the name entirely?


pkg/server/server.go, line 556 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah definitely, but also what can be in a hook? You want to run this early (to avoid setting up all of those moving parts), so all you'll have is the slice of engine.Engines, you can't do KV, you don't have your *Node or even your *Server set up.

yeah, I think this should just be a slice of engine.Engines

<what was there before: Previously, ...>
<why it needed to change: This was inadequate because ...>
<what you did about it: To address this, this patch ...>

Release note (<category, see below>): <what> <show> <why>
@danhhz
Copy link
Contributor Author

danhhz commented Apr 3, 2020

Don't think this is useful anymore. Closing

@danhhz danhhz closed this Apr 3, 2020
@danhhz danhhz deleted the migration branch April 3, 2020 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

*: long-running migrations
3 participants