storage: support adding Stores to existing Nodes #39415

rkruze · 2019-08-07T18:10:19Z

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

In an existing cluster, if I take one node down and then add a new store to the node, the node comes back up but it is not able to connect to the cluster.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Setup a CRDB cluster with 3 nodes and 1 store each.
Shutdown one CRDB node and then add a new store to this node and bring back up.
Look at the logs files and notice the following:

W190807 17:27:33.231525 165 storage/store_rebalancer.go:227  [n1,s1,store-rebalancer] StorePool missing descriptor for local store
W190807 17:27:40.779085 35456 storage/replica_range_lease.go:982  [n1,replicate,s1,r24/1:/Table/5{6-7}] have been waiting 1m0s attempting to acquire lease
W190807 17:27:40.779149 35457 storage/replica_write.go:199  [n1,s1,r24/1:/Table/5{6-7}] have been waiting 60.00s for proposing command RequestLease [/Table/56,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

    https://yourhost:8080/#/reports/range/24

and the following Raft status: {"id":"1","term":9,"vote":"1","commit":31,"lead":"0","raftState":"StatePreCandidate","applied":31,"progress":{},"leadtransferee":"0"}

Expected behavior
The node should come back online and connect to the cluster with the new store.

Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue
creating this issue, but contact support@cockroachlabs.com to submit the log
files in private.

If applicable, add screenshots to help explain your problem.
Logs are here: cockroach.log
Environment:

CockroachDB version 19.1.3/19.1.2
Server OS: Linux and MacOS

Additional context
What was the impact?

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-08-08T20:29:13Z

I'm able to reproduce this on master. I can see that we're correctly detecting the bootstrapped and empty engine here:

cockroach/pkg/server/server.go

Lines 1375 to 1379 in 311f7e4

    
           // We ran this before, but might've bootstrapped in the meantime. This time 
        
           // we'll get the actual list of bootstrapped and empty engines. 
        
           bootstrappedEngines, emptyEngines, cv, err := inspectEngines( 
        
           	ctx, s.engines, s.cfg.Settings.Version.MinSupportedVersion, 
        
           	s.cfg.Settings.Version.ServerVersion, &s.rpcContext.ClusterID)

nvanbenschoten · 2019-08-08T20:31:57Z

The startup thread appears to be stuck here:

goroutine 27 [select, 3 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal(0xc000523590, 0x7da5820, 0xc0009928a0, 0xc00035af40, 0xc, 0xc, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:392 +0x4ba
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).LookupRangeDescriptorWithEvictionToken(...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:266
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).getDescriptor(0xc0002046c0, 0x7da5820, 0xc0009928a0, 0xc00035af40, 0xc, 0xc, 0x0, 0xc000280300, 0x2faf080, 0x3b9aca00, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:480 +0x9a
github.com/cockroachdb/cockroach/pkg/kv.(*RangeIterator).Seek(0xc000e04af8, 0x7da5820, 0xc0009928a0, 0xc00035af40, 0xc, 0xc, 0xc000e04800)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/range_iter.go:156 +0x213
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0002046c0, 0x7da5820, 0xc0009928a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1084 +0x191
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0002046c0, 0x7da5820, 0xc000d12090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:732 +0xd39
github.com/cockroachdb/cockroach/pkg/internal/client.(*CrossRangeTxnWrapperSender).Send(0xc0005b00e0, 0x7da5820, 0xc000d12090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:218 +0xaf
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc0005b0080, 0x7da5820, 0xc000d12090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:670 +0x122
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).send(...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:652
github.com/cockroachdb/cockroach/pkg/internal/client.sendAndFill(0x7da5820, 0xc000d12090, 0xc000e054d8, 0xc000978b00, 0xc, 0xc00035af40)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:595 +0xf7
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Run(0xc0005b0080, 0x7da5820, 0xc000d12090, 0xc000978b00, 0xc000e055f0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:618 +0x9c
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Inc(0xc0005b0080, 0x7da5820, 0xc000d12090, 0x72c0b00, 0xc0003e0ac0, 0x1, 0x3fc3333333333333, 0x0, 0x2faf080, 0x3b9aca00, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:389 +0xb1
github.com/cockroachdb/cockroach/pkg/internal/client.IncrementValRetryable(0x7da5820, 0xc000d12090, 0xc0005b0080, 0xc00035af40, 0xc, 0xc, 0x1, 0x700000000, 0x100000013, 0x700000000)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:724 +0x26e
github.com/cockroachdb/cockroach/pkg/server.allocateStoreIDs(0x7da5820, 0xc000d12090, 0xc000000003, 0x1, 0xc0005b0080, 0x0, 0x0, 0xc000e05b50)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:177 +0x8b
github.com/cockroachdb/cockroach/pkg/server.(*Node).bootstrapStores(0xc0007d0000, 0x7da5820, 0xc000d12090, 0xc0002db220, 0x1, 0x1, 0xc0005bb170, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:592 +0x1c0
github.com/cockroachdb/cockroach/pkg/server.(*Node).start(0xc0007d0000, 0x7da5820, 0xc000d12090, 0x7d63be0, 0xc00035e1e0, 0x7d63be0, 0xc00035e220, 0xc0002db210, 0x1, 0x1, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:479 +0x11de
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc00054f000, 0x7da5820, 0xc0005a1c20, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1393 +0x1b33
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2(0xc0005bb170, 0xc00023c068, 0xc0004460b0, 0x7da5820, 0xc0005a1c20, 0x108d3e60, 0xed4de7c9b, 0x0, 0x1, 0x5)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:689 +0x100
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3(0xc00023c068, 0x7da5820, 0xc0005a1c20, 0x7dfec60, 0xc000a4d6c0, 0xc0005bb170, 0xc0004460b0, 0x108d3e60, 0xed4de7c9b, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:790 +0x11c
created by github.com/cockroachdb/cockroach/pkg/cli.runStart
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:645 +0x94c

nvanbenschoten · 2019-08-08T20:36:34Z

The range lookup that is trying to perform is stuck two meta lookups up in redirectOnOrAcquireLease on its local replica for Range 1:

goroutine 144 [select, 3 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).redirectOnOrAcquireLease.func2(0xc000aec840, 0xc00098c800, 0x7da5820, 0xc0009936b0, 0xc0002b0930, 0xc0003de8f8, 0xc0003de8f8)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go:1002 +0x24e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).redirectOnOrAcquireLease(0xc00098c800, 0x7da5820, 0xc0009936b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go:1058 +0x1c7
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeReadOnlyBatch(0xc00098c800, 0x7da5820, 0xc0009936b0, 0xc00065eb00, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_read.go:36 +0xab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00098c800, 0x7da5820, 0xc000993650, 0x1, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, 0x0, 0x1, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:563 +0x2c9
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00098c800, 0x7da5820, 0xc000993650, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:508 +0x83
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc000a34000, 0x7da5820, 0xc0009935c0, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2956 +0x5c7
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc00063af00, 0x7da5820, 0xc0009935c0, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:181 +0xce
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x7da5820, 0xc000993560, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:916 +0x1e8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc0005bb170, 0x7da5820, 0xc000993560, 0x74f7735, 0x10, 0xc0031bbd50, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:300 +0xeb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc0007d0000, 0x7da5820, 0xc000993560, 0xc00065ea00, 0xc000993560, 0xc000992ab0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:904 +0x195
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc0007d0000, 0x7da5820, 0xc000993530, 0xc00065ea00, 0xc000992e70, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:942 +0x9f
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:540
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc000996c00, 0x7da5820, 0xc000993530, 0x3, 0x7d67720, 0xc0002ffd70, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:199 +0x128
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc000996c00, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x300000003, 0x2, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:168 +0x15d
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0002046c0, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1670 +0x2a3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0002046c0, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:436 +0x196
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0002046c0, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:518 +0x308
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0002046c0, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1441 +0x3ce
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).divideAndSendBatchToRanges(0xc0002046c0, 0x7da5820, 0xc000992e70, 0x15b90c71b09c8ef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1090 +0x1536
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).Send(0xc0002046c0, 0x7da5820, 0xc000992de0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:732 +0xd39
github.com/cockroachdb/cockroach/pkg/internal/client.lookupRangeFwdScan(0x7da5820, 0xc000992de0, 0x7d2fe00, 0xc0002046c0, 0xc0003de4a0, 0xe, 0xe, 0x1, 0x8, 0x1dcd6500, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/range_lookup.go:328 +0x3f2
github.com/cockroachdb/cockroach/pkg/internal/client.RangeLookup(0x7da5820, 0xc000992de0, 0x7d2fe00, 0xc0002046c0, 0xc0003de4a0, 0xe, 0xe, 0x1, 0x8, 0x4762500, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/internal/client/range_lookup.go:198 +0x8c9
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).RangeLookup(0xc0002046c0, 0x7da5820, 0xc000992de0, 0xc0003de4a0, 0xe, 0xe, 0x7da5800, 0xc000992de0, 0xc000b06000, 0xc000996800, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:359 +0xd5
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).performRangeLookup(0xc000523590, 0x7da5820, 0xc000992de0, 0xc0003de4a0, 0xe, 0xe, 0x7dfec00, 0xc0003396c0, 0x0, 0x18, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:449 +0x2ca
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal.func3(0x0, 0x0, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:322 +0x131
github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).doCall(0xc0005235c8, 0xc000994320, 0xc0003d6920, 0x15, 0xc0009967c0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:128 +0x2e
created by github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight.(*Group).DoChan
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/singleflight/singleflight.go:121 +0x29d

This is bringing up memories of #37906.

nvanbenschoten · 2019-08-08T20:40:29Z

This write is stuck waiting for replication here:

goroutine 480 [select, 3 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc00098c800, 0x7da5820, 0xc0034eef60, 0xc00069e680, 0x0, 0x0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_write.go:178 +0xa6b
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc00098c800, 0x7da5820, 0xc0034eef60, 0x1, 0x15b90c72639f43a8, 0x0, 0x0, 0x0, 0x0, 0x1, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:560 +0x769
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc00098c800, 0x7da5820, 0xc0034eef60, 0x15b90c72639f43a8, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, ...)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:508 +0x83
github.com/cockroachdb/cockroach/pkg/storage.(*pendingLeaseRequest).requestLeaseAsync.func2(0x7da5820, 0xc0034eef60)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go:401 +0x57b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0005bb170, 0x7da5820, 0xc0034eef60, 0xc0003e2a00, 0x35, 0x7dfec60, 0xc000a4c8c0, 0xc0032265a0)
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:321 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/Users/nathan/Go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:316 +0x131

Time to logspy and figure out what's going on on that Range.

nvanbenschoten · 2019-08-08T21:44:39Z

The replica on the restarted node keeps trying to call a Raft election but it's never getting a response. We see this in:

I190808 20:42:19.935135 197 vendor/go.etcd.io/etcd/raft/raft.go:912  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 is starting a new election at term 7
I190808 20:42:19.935151 197 vendor/go.etcd.io/etcd/raft/raft.go:727  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 became pre-candidate at term 7
I190808 20:42:19.935170 197 vendor/go.etcd.io/etcd/raft/raft.go:813  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 received MsgPreVoteResp from 2 at term 7
I190808 20:42:19.935197 197 vendor/go.etcd.io/etcd/raft/raft.go:800  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 [logterm: 7, index: 193] sent MsgPreVote request to 3 at term 7
I190808 20:42:19.935248 197 vendor/go.etcd.io/etcd/raft/raft.go:800  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 [logterm: 7, index: 193] sent MsgPreVote request to 1 at term 7
I190808 20:42:19.935273 197 storage/replica_raft.go:954  [n3,s3,r1/2:/{Min-System/NodeL…}] pending commands: reproposing 1 (at 191.84) reasonTicks
I190808 20:42:19.935300 197 vendor/go.etcd.io/etcd/raft/raft.go:1237  [n3,s3,r1/2:/{Min-System/NodeL…}] 2 no leader at term 7; dropping proposal

Meanwhile, the other two nodes seem to be having a hard time sending Raft RPCs to node 3:

W190808 20:19:17.367939 6920 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:
W190808 20:19:17.966384 7038 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:
W190808 20:19:18.367151 6923 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.MultiRaft/RaftMessageBatch not available:

This looks to be due to the following restriction on the RPCs that can be served during bootstrapping:

cockroach/pkg/server/grpc_server.go

Lines 65 to 70 in 2b23499

    
           var rpcsAllowedWhileBootstrapping = map[string]struct{}{ 
        
           	"/cockroach.rpc.Heartbeat/Ping":             {}, 
        
           	"/cockroach.gossip.Gossip/Gossip":           {}, 
        
           	"/cockroach.server.serverpb.Init/Bootstrap": {}, 
        
           	"/cockroach.server.serverpb.Status/Details": {}, 
        
           }

nvanbenschoten · 2019-08-08T21:44:48Z

By the way, I tested this on master, release-2.1, and release-2.0. The first two exhibit the issue but the last one doesn't, meaning that this was a regression in release-2.1.

nvanbenschoten · 2019-08-08T21:49:03Z

Adding /cockroach.storage.MultiRaft/RaftMessageBatch to the rpcsAllowedWhileBootstrapping set fixes the issue. I don't think it's actually the correct fix though.

Fixes cockroachdb#39415. This commit fixes the issue we saw in cockroachdb#39415, where a node would attempt to allocate a store ID for a new store on an existing node and would get stuck. The reason it would get stuck is because the KV request to allocate the store ID could be routed to the node's own replica for Range 1 (when doing a range descriptor lookup) over the node's internalClientAdapter. This KV request could then get stuck attempting to acquire a lease because we would still be blocking all Raft RPC traffic (see rpcsAllowedWhileBootstrapping). This would cause the KV request to stall indefinitely. The fix is to tie the operationality of the internalClientAdapter to that of the rpcServer. If we don't want to be serving external KV traffic during this node initialization period then we shouldn't be serving internal KV traffic. DNM: this still needs a higher-level test that verifies that the bug during the bootstrap problem we saw when adding a store to a node with existing stores is no longer possible. I have only verified this manually so far. Release note: None

nvanbenschoten · 2020-01-13T15:38:54Z

Here's a script to reproduce this issue using roachprod and a local cluster:

roachprod create local --nodes 3
roachprod stage local cockroach # or roachprod put local cockroach
roachprod start local
roachprod stop local:1
# Command will succeed but node won't start. Would block if `cockroach start` was run manually without `--background`.
roachprod start local:1 --args="--store=path=${HOME}/local/1/data --store=path=${HOME}/local/1/data2"

israellot · 2020-05-06T13:58:05Z

Will this be fixed on v20 ?

nvanbenschoten · 2020-05-06T18:16:38Z

Hi @israellot. This has not yet been fixed yet in v20.1 but is still on our near-term roadmap. Do you mind describing the sequence of events that led you to this issue? Is it currently blocking you?

israellot · 2020-05-06T18:32:40Z

Hi,
This is currently not blocking me. I've tried to add another disk to existing nodes and hit this limitation.
My workaround was to add new nodes and decomission old ones until I had replaced all nodes on my cluster.

tbg · 2020-07-14T13:17:49Z

I think we should pull (*Node).bootstrapStores up into (*Server).Start, approximately here:

https://github.com/cockroachdb/cockroach/blob/e0a37580946c33c26341b70b8bdfbe60741b9281/pkg/server/server.go#L1658-L1657

The point is that adding the additional stores should have completed by the time (*Server).Start returns (or testing will be annoying) but we don't want it to otherwise block the startup process. (For example, the current home, (*Node).start, happens before starting node liveness).

The other option is to just make it async (in which case it can live anywhere but I would try to avoid that as well).

irfansharif · 2020-09-02T01:36:41Z

Not exactly a starter starter project, but tagging it as such because it'd make for a good intern project this fall. Seems like we've been wanting to do this for a while.

irfansharif · 2020-09-30T16:32:21Z

Hm, I tried the following setup off of master, and it works? I wonder if I inadvertently fixed this in #32574?

make build;
roachprod destroy local; roachprod wipe local;
roachprod create local --nodes 3;
roachprod put local cockroach;
roachprod start local;
roachprod stop local:1;
roachprod start local:1 --args="--store=path=${HOME}/local/1/data --store=path=${HOME}/local/1/data2"

Maybe all that's remaining here is to add a test. I'll try finding out when exactly this started working again (a strange bisect to do, ha).

irfansharif · 2020-09-30T17:27:49Z

I wonder if I inadvertently fixed this in #32574?

Can't take any credit here. I can repro it in 19.2, but can't in 20.1 and beyond.

tbg · 2020-10-01T07:51:30Z

I don't know why this readily repros in 19.2 but not in 20.1+, but I'm pretty sure that the current code still has the same problems. We are bootstrapping additional stores (i.e. using KV) in (*Node).start, but we are only making the grpc server operational well after the node starts:

cockroach/pkg/server/server.go

Line 1565 in 50ffec8

s.grpc.setMode(modeOperational)

It makes sense to me that the problem wouldn't occur reliably, though. If the bootstrap requests don't hit a local replica, all is well - the rest of the cluster is operational (in this example), so it will do the KV operation, and things smooth out.
I actually suspect that we've done something similar to #39497, perhaps inadvertently. The problem was reliable because the node would always route to itself, and thus would always get stuck on itself (it's not serving requests yet) but would not allow incoming raft messages from the actual raft leader. So something here must have improved.

That being as it may, though, there is still a general problem with allocating these extra NodeIDs in (*Node).start and it's exactly the same as that of not being able to allocate these extra storeIDs in ServeAndWait. Typical example:

three nodes, all initialized
bring all nodes down
restart n1 with added extra store
it will get stuck trying to alloc the node ID
it will never be ready (i.e. never get to modeOperational)
startup of n1 will hang, but let's assume operator starts n2 anyway, with an extra store as well
n2 gets stuck in the same place
n3 gets stuck in the same place

Basically we have to allocate these extra nodeIDs asynchronously, so that's what we should do.

tbg · 2020-10-01T07:52:29Z

(or we do it after switching to modeOperational - however this will mean sequential restarts won't work, hence I opt for fully throwing it into a goroutine. BTW this is how it worked in the early days of CRDB until I got us into this mess with a well-meaning drive-by change)

irfansharif · 2020-10-02T14:54:57Z

Asynchronously allocating store IDs/bootstrapping new stores makes sense to me. My reproduction steps above were just noise, I had assumed the problem was deterministic. I was able to repro it 20.1+, though a bit less reliably.

We need to bootstrap additional stores asynchronously. Consider the range that houses the store ID allocator. When restarting the set of nodes that holds a quorum of these replicas, when restarting them with additional stores, those additional stores will require store IDs to get fully bootstrapped. But if we're gating node start (specifically opening up the RPC floodgates) on having all stores fully bootstrapped, we'll simply hang when trying to allocate store IDs. See TestAddNewStoresToExistingNodes and cockroachdb#39415 for more details. Instead we opt to bootstrap additional stores asynchronously, and rely on the blocking function to signal to the caller that all stores have been fully bootstrapped. Release note: None Co-authored-by: Sam Huang <samh@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

55350: server: support adding stores to existing nodes r=irfansharif a=TheSamHuang We need to bootstrap additional stores asynchronously. Consider the range that houses the store ID allocator. When restarting the set of nodes that holds a quorum of these replicas, when restarting them with additional stores, those additional stores will require store IDs to get fully bootstrapped. But if we're gating node start (specifically opening up the RPC floodgates) on having all stores fully bootstrapped, we'll simply hang when trying to allocate store IDs. See TestAddNewStoresToExistingNodes and #39415 for more details. Instead we opt to bootstrap additional stores asynchronously, and rely on the blocking function to signal to the caller that all stores have been fully bootstrapped. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Aug 7, 2019

nvanbenschoten self-assigned this Aug 8, 2019

nvanbenschoten mentioned this issue Aug 9, 2019

rpc: don't enable internalClientAdapter until rpcServer is operational #39497

Closed

awoods187 assigned johnrk-zz and unassigned johnrk-zz Aug 13, 2019

awoods187 added the S-3-productivity Severe issues that impede the productivity of CockroachDB developers. label Aug 13, 2019

knz mentioned this issue Aug 27, 2019

storage: replicas not balanced across multiple stores in same node #6782

Closed

nvanbenschoten mentioned this issue Sep 5, 2019

Known Limitation: You cannot add stores to an existing node cockroachdb/docs#5371

Closed

jseldess added docs-known-limitation docs-done labels Dec 4, 2019

nvanbenschoten changed the title ~~when a node comes up with both an existing store and a new store it is not able to connect to the cluster~~ storage: support adding Stores to existing Nodes Jan 3, 2020

nvanbenschoten assigned OwenQian Jan 8, 2020

nvanbenschoten unassigned OwenQian Jun 3, 2020

lunevalex mentioned this issue Jul 13, 2020

storage: allocate NodeID via one-off RPC to initialized node #32574

Closed

irfansharif added the E-starter Might be suitable for a starter project for new employees or team members. label Sep 2, 2020

irfansharif assigned TheSamHuang and unassigned nvanbenschoten Sep 30, 2020

tbg mentioned this issue Oct 1, 2020

kvserver,rpc: set the stage for maintaining a local blocklist of permanently removed nodes #54936

Merged

TheSamHuang linked a pull request Oct 12, 2020 that will close this issue

server: support adding stores to existing nodes #55350

Merged

irfansharif mentioned this issue Oct 19, 2020

server: support adding stores to existing nodes #55350

Merged

craig bot closed this as completed in #55350 Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: support adding Stores to existing Nodes #39415

storage: support adding Stores to existing Nodes #39415

rkruze commented Aug 7, 2019 •

edited

Loading

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Jan 13, 2020

israellot commented May 6, 2020

nvanbenschoten commented May 6, 2020

israellot commented May 6, 2020

tbg commented Jul 14, 2020

irfansharif commented Sep 2, 2020

irfansharif commented Sep 30, 2020

irfansharif commented Sep 30, 2020

tbg commented Oct 1, 2020

tbg commented Oct 1, 2020

irfansharif commented Oct 2, 2020

storage: support adding Stores to existing Nodes #39415

storage: support adding Stores to existing Nodes #39415

Comments

rkruze commented Aug 7, 2019 • edited Loading

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Aug 8, 2019

nvanbenschoten commented Jan 13, 2020

israellot commented May 6, 2020

nvanbenschoten commented May 6, 2020

israellot commented May 6, 2020

tbg commented Jul 14, 2020

irfansharif commented Sep 2, 2020

irfansharif commented Sep 30, 2020

irfansharif commented Sep 30, 2020

tbg commented Oct 1, 2020

tbg commented Oct 1, 2020

irfansharif commented Oct 2, 2020

rkruze commented Aug 7, 2019 •

edited

Loading