[WIP] migrations: introduce new system which supports long-running migrations #44249

danhhz · 2020-01-23T00:39:52Z

This subsumes the cluster version cluster setting. There is a large
overlap with pkg/sqlmigrations, which should be cleaned up in the
future.

Closes #39182

Release note (<category, see below>):

cockroach-teamcity · 2020-01-23T00:39:59Z

This change is

danhhz · 2020-01-23T00:40:48Z

@tbg this is a very early draft and mostly unfinished, but I wanted to get it out in time to get your initial feedback tomorrow

tbg

Reviewed 14 of 14 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)

pkg/migration/migration.go, line 52 at r1 (raw file):

}

func init() {

We've amassed a larger list of migrations that fit this bill, right? For me personally I think the most useful thing would be to see them all stubbed out here, to have a good sample of the kinds of things we're expecting the migrations to need to do. The separate raft log engine migration for example, or some migration that uses #42822, and whatever other ones we've mentioned.

pkg/migration/connect/connect.go, line 30 at r1 (raw file):

	// (filtering out self, etc). On one hand it'd be nice to keep this from
	// depending on gossip (so the response could be _used_ by gossip), but it
	// would be unfortuante to duplicate this initial work.

Yeah the joining process is quite complex (think about resolvers, etc), that was also my main worry over in #32574
It's worth figuring out whether we're allowed to simplify the semantics. Such as: a node that starts up must be able to reach another initialized (or soon to be initialized) node via its join flags. For example it would be illegal to start a new node with only bogus join flags and to rely on it being mentioned in other nodes' join flags (who could then reach out to it).

pkg/migration/everynode/everynode.go, line 72 at r1 (raw file):

// RunHookOnEveryNode executes the requested hook closure on each node in the
// cluster and each node added to the cluster in the future. If no error is
// returned, the caller is guaranteed that no node will ever again serve SQL or

Make this stronger: connect to the cluster

Lots of stuff happens before "serving SQL or KV traffic". We want to preempt all of the moving parts here.

pkg/migration/everynode/everynode.go, line 79 at r1 (raw file):

// RPC and returned a successful response.
//
// WIP how do we communicate to the user if this is blocking while a node is

I was hoping the job could have a good description of where it's at, but maybe that's not enough

pkg/migration/everynode/everynode.go, line 82 at r1 (raw file):

// down?
//
// WIP what about nodes that are decommissioned and later re-added? Is this even

It's "legal" today but won't be legal in the future. If we wanted to prevent this, we add a NodeID graveyard against which the incoming Connect calls are checked (init'ed nodes will put their NodeID into the RPC). As is, we'd be relying on the operator to just not bring decommissioned nodes back, which I agree we'll ultimately have to fix.

pkg/migration/fflag/fflag.go, line 39 at r1 (raw file):

//
// WIP how does this work for cluster init? separate method in this package?
func GetHandle(bootVersion ClusterVersion) *Handle {

NewHandle?

pkg/migration/fflag/fflag.go, line 47 at r1 (raw file):

However, a feature that is active once is guaranteed to be active for eternity.

Still true, right? Regarding the "feature flag" framing, are feature flags generally thought of as reversible? Because here they aren't. In general, as we release new functionality, as long as it's a true feature flag, this means we'd keep it out of.. these feature flags (but use a cluster setting or such instead)? I wonder if that could be confusing - the only feature flags that we have in the migrations are the non-reversible ones, but I'm not so sure I'd think of them as "feature flags" any more... it's more of a multi-frontier migration, where you can migrate into features individually.

I read https://martinfowler.com/articles/feature-toggles.html to gut-check my intuition and there it seems that a "feature flag" should be a simple switch that goes both ways. So maybe we should reconsider the naming here and stick to "migrations" naming while perhaps in the future adding a separate mechanism for true feature flags once we feel that is something we need (not today).

pkg/migration/fflag/fflag.go, line 97 at r1 (raw file):

		Version: roachpb.Version{Major: 19, Minor: 2, Unstable: 3},
	},
	{

Since we're here, it has always bothered me that this was growing the wrong direction. It'd be easier if wereversed this, then the VersionKey and the new version are next to each other in code.

pkg/server/server.go, line 213 at r1 (raw file):

func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
	// WIP do we know way up here whether this node is bootstrapping the cluster
	// or not?

No, that happens only way down, when we .start the Node:

cockroach/pkg/server/server.go

Lines 1341 to 1368 in 46a38c9

    
           bootstrappedEngines, _, _, err := inspectEngines( 
        
           	ctx, s.engines, 
        
           	cluster.Version.BinaryMinSupportedVersion(s.cfg.Settings), 
        
           	cluster.Version.BinaryVersion(s.cfg.Settings), 
        
           	&s.rpcContext.ClusterID) 
        
           if err != nil { 
        
           	return errors.Wrap(err, "inspecting engines") 
        
           } 
        
           // Filter the gossip bootstrap resolvers based on the listen and 
        
           // advertise addresses. 
        
           listenAddrU := util.NewUnresolvedAddr("tcp", s.cfg.Addr) 
        
           advAddrU := util.NewUnresolvedAddr("tcp", s.cfg.AdvertiseAddr) 
        
           advSQLAddrU := util.NewUnresolvedAddr("tcp", s.cfg.SQLAdvertiseAddr) 
        
           filtered := s.cfg.FilterGossipBootstrapResolvers(ctx, listenAddrU, advAddrU) 
        
           s.gossip.Start(advAddrU, filtered) 
        
           log.Event(ctx, "started gossip") 
        
           if s.cfg.DelayedBootstrapFn != nil { 
        
           	defer time.AfterFunc(30*time.Second, s.cfg.DelayedBootstrapFn).Stop() 
        
           } 
        
           var hlcUpperBoundExists bool 
        
           // doBootstrap is set if we're the ones who bootstrapped the cluster. 
        
           var doBootstrap bool 
        
           if len(bootstrappedEngines) > 0 { 
        
           	// The cluster was already initialized. 
        
           	doBootstrap = false

There's no reason this couldn't be lifted, though. Note also that the logic will simplify after #44112: if we have no initialized engines, we go into init mode (i.e. wait for a bootstrap cmd or for one of our peers to become initialized). If there are init'ed engines, we're already set up. So no need to look at the join flags, though it raises the question of what it means to start a node without a join flag.

The natural sequencing here is to let Raphael's PR merge (quite a bit of work involved in that one, too). Then, pull out the determination of whether to bootstrap out into something we can query before the call to Connect. If we're bootstrapped, send NodeID in connect (otherwise don't and let Connect alloc a NodeID and StoreIDs for you). This actually boils down cleanly to "do #32574"? Adding the migration stuff you need should be straightforward after that.

By the way, I would be quite excited to have that happen. That work item has always gotten in the way of seriously thinking about cleaning up our server startup code.

pkg/server/server.go, line 556 at r1 (raw file):

		txnMetrics, nil /* execCfg */, &s.rpcContext.ClusterID)

	// WIP is this too late in the boot process?

Yeah definitely, but also what can be in a hook? You want to run this early (to avoid setting up all of those moving parts), so all you'll have is the slice of engine.Engines, you can't do KV, you don't have your *Node or even your *Server set up.

danhhz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @tbg)

pkg/migration/fflag/fflag.go, line 47 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

However, a feature that is active once is guaranteed to be active for eternity.

Still true, right? Regarding the "feature flag" framing, are feature flags generally thought of as reversible? Because here they aren't. In general, as we release new functionality, as long as it's a true feature flag, this means we'd keep it out of.. these feature flags (but use a cluster setting or such instead)? I wonder if that could be confusing - the only feature flags that we have in the migrations are the non-reversible ones, but I'm not so sure I'd think of them as "feature flags" any more... it's more of a multi-frontier migration, where you can migrate into features individually.

I read https://martinfowler.com/articles/feature-toggles.html to gut-check my intuition and there it seems that a "feature flag" should be a simple switch that goes both ways. So maybe we should reconsider the naming here and stick to "migrations" naming while perhaps in the future adding a separate mechanism for true feature flags once we feel that is something we need (not today).

Makes sense, I just needed a new package name for dep reasons. Does something like feature gate feel less flip-floppy or should I stay away from "feature" in the name entirely?

pkg/server/server.go, line 556 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah definitely, but also what can be in a hook? You want to run this early (to avoid setting up all of those moving parts), so all you'll have is the slice of engine.Engines, you can't do KV, you don't have your *Node or even your *Server set up.

yeah, I think this should just be a slice of engine.Engines

Dep reasons. Release note: None

<what was there before: Previously, ...> <why it needed to change: This was inadequate because ...> <what you did about it: To address this, this patch ...> Release note (<category, see below>): <what> <show> <why>

danhhz · 2020-04-03T16:48:29Z

Don't think this is useful anymore. Closing

danhhz requested a review from tbg January 23, 2020 00:39

danhhz force-pushed the migration branch from 2e19106 to 7381c2a Compare January 23, 2020 22:50

tbg reviewed Jan 24, 2020

View reviewed changes

danhhz mentioned this pull request Jan 30, 2020

cli: remove auto-init with cockroach start without --join #44112

Closed

danhhz commented Jan 30, 2020

View reviewed changes

danhhz added 2 commits January 30, 2020 13:10

clusterversion: split ClusterVersion out of pkg/settings/cluster

f5dc9ef

Dep reasons. Release note: None

WIP clusterversion,migration: begin long-running migration prototype

b2abf1c

<what was there before: Previously, ...> <why it needed to change: This was inadequate because ...> <what you did about it: To address this, this patch ...> Release note (<category, see below>): <what> <show> <why>

danhhz force-pushed the migration branch from 7381c2a to b2abf1c Compare February 6, 2020 19:56

irfansharif mentioned this pull request Feb 28, 2020

clusterversion: significantly rework cluster version handling #45455

Merged

danhhz closed this Apr 3, 2020

danhhz deleted the migration branch April 3, 2020 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] migrations: introduce new system which supports long-running migrations #44249

[WIP] migrations: introduce new system which supports long-running migrations #44249

danhhz commented Jan 23, 2020

cockroach-teamcity commented Jan 23, 2020

danhhz commented Jan 23, 2020

tbg left a comment

danhhz left a comment

danhhz commented Apr 3, 2020

	bootstrappedEngines, _, _, err := inspectEngines(
	ctx, s.engines,
	cluster.Version.BinaryMinSupportedVersion(s.cfg.Settings),
	cluster.Version.BinaryVersion(s.cfg.Settings),
	&s.rpcContext.ClusterID)
	if err != nil {
	return errors.Wrap(err, "inspecting engines")
	}

	// Filter the gossip bootstrap resolvers based on the listen and
	// advertise addresses.
	listenAddrU := util.NewUnresolvedAddr("tcp", s.cfg.Addr)
	advAddrU := util.NewUnresolvedAddr("tcp", s.cfg.AdvertiseAddr)
	advSQLAddrU := util.NewUnresolvedAddr("tcp", s.cfg.SQLAdvertiseAddr)
	filtered := s.cfg.FilterGossipBootstrapResolvers(ctx, listenAddrU, advAddrU)
	s.gossip.Start(advAddrU, filtered)
	log.Event(ctx, "started gossip")

	if s.cfg.DelayedBootstrapFn != nil {
	defer time.AfterFunc(30*time.Second, s.cfg.DelayedBootstrapFn).Stop()
	}

	var hlcUpperBoundExists bool
	// doBootstrap is set if we're the ones who bootstrapped the cluster.
	var doBootstrap bool
	if len(bootstrappedEngines) > 0 {
	// The cluster was already initialized.
	doBootstrap = false

[WIP] migrations: introduce new system which supports long-running migrations #44249

[WIP] migrations: introduce new system which supports long-running migrations #44249

Conversation

danhhz commented Jan 23, 2020

cockroach-teamcity commented Jan 23, 2020

danhhz commented Jan 23, 2020

tbg left a comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

danhhz commented Apr 3, 2020