Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow heart make the cluster crash, and can not recover by restart #45929

Closed
yangxuanjia opened this issue Mar 10, 2020 · 29 comments
Closed

slow heart make the cluster crash, and can not recover by restart #45929

yangxuanjia opened this issue Mar 10, 2020 · 29 comments
Labels
O-community Originated from the community

Comments

@yangxuanjia
Copy link

yangxuanjia commented Mar 10, 2020

Describe the problem
slow heart make the cluster crash, and can not recover by restart.

W200307 22:15:13.201020 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 1.3s
I200307 22:15:17.715083 2467 server/status/runtime.go:500  [n1] runtime stats: 9.8 GiB RSS, 710 goroutines, 1.2 GiB/606 MiB/2.1 GiB GO alloc/idle/total, 6.7 GiB/8.5 GiB CGO alloc/total, 110953.5 CGO/sec, 82.7/17.5 %(u/s)time, 0.0 %gc (1x), 3.1 MiB/3.5 MiB (r/w)net
I200307 22:15:20.947038 2474 storage/node_liveness.go:775  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
W200307 22:15:20.947068 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200307 22:15:20.947086 2474 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200307 22:15:21.038798 180 storage/closedts/provider/provider.go:156  [ct-closer] unable to move closed timestamp forward: not live
W200307 22:15:24.600922 2469 server/node.go:880  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:liveness.heartbeatfailures Value:1 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
W200307 22:15:24.821016 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 3.9s
W200307 22:15:24.821081 23074906 storage/node_liveness.go:523  [n1,s1,r55038/2:/Table/57/1/1014808{07/…-24/…}] slow heartbeat took 3.4s
E200307 22:15:25.351386 23126084 storage/replica_range_lease.go:336  [n1,s1,r202167/2:/Table/57/2/1696990{48/…-90/…}] mismatch incrementing epoch for {NodeID:5 Epoch:2 Expiration:1583619324.226160621,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}; actual is {NodeID:5 Epoch:2 Expiration:1583619333.226119278,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
W200307 22:15:27.182849 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 1.7s
W200307 22:15:27.182908 23126115 storage/node_liveness.go:523  [n1,s1,r55038/2:/Table/57/1/1014808{07/…-24/…}] slow heartbeat took 1.1s
I200307 22:15:27.756652 2467 server/status/runtime.go:500  [n1] runtime stats: 9.9 GiB RSS, 825 goroutines, 890 MiB/864 MiB/2.1 GiB GO alloc/idle/total, 6.7 GiB/8.5 GiB CGO alloc/total, 52833.9 CGO/sec, 298.9/56.5 %(u/s)time, 0.0 %gc (2x), 12 MiB/15 MiB (r/w)net
W200307 22:15:34.589185 2469 server/node.go:880  [n1,summaries] health alerts detected: {Alerts:[{StoreID:1 Category:METRICS Description:ranges.unavailable Value:7524 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} {StoreID:1 Category:METRICS Description:ranges.underreplicated Value:36292 XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

To Reproduce

  1. just batch write data, lot of data.
  2. have some select, most like select ... where name in [XX,XX(60K)]

Expected behavior
ok

Additional data / screenshots

Click to expand
W200307 22:16:30.750660 23125919 storage/replica_range_lease.go:982  [n1,consistencyChecker,s1,r83392/2:/Table/57/3/"xxxdata…] have been waiting 1m0s attempting to acquire lease
I200307 22:16:30.750719 23125919 storage/replica_range_lease.go:987  [n1,consistencyChecker,s1,r83392/2:/Table/57/3/"xxxdata…] slow lease acquisition finished after 1m0.000119336s with error [NotLeaseHolderError] r83392: replica (n1,s1):2 not lease holder; lease holder unknown after 1 attempts
W200307 22:16:30.752304 23125920 storage/replica_write.go:199  [n1,s1,r83392/2:/Table/57/3/"xxxdata…] have been waiting 60.00s for proposing command RequestLease [/Table/57/3/"xxxdata"/2017-12-19T13:17:18Z/110339035/1066883614/0,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	http://yourhost:80/#/reports/range/83392

and the following Raft status: {"id":"2","term":10,"vote":"1","commit":591,"lead":"1","raftState":"StateFollower","applied":591,"progress":{},"leadtransferee":"0"}
W200307 22:16:31.608114 23126307 storage/replica_range_lease.go:982  [n1,merge,s1,r83392/2:/Table/57/3/"xxxdata…] have been waiting 1m0s attempting to acquire lease
E200307 22:17:31.608382 23127557 storage/queue.go:829  [n1,merge,s1,r115946/2:/Table/57/4/2018-09-14T16:1…] aborted during DistSender.Send: context deadline exceeded
E200307 22:18:37.609275 23130852 storage/queue.go:829  [n1,merge,s1,r210205/2:/Table/57/3/"xxxdata2"…] aborted during DistSender.Send: context deadline exceeded
W200307 22:23:57.641914 23139566 storage/replica_range_lease.go:982  [n1,merge,s1,r214697/2:/Table/57/3/"xxxdata3…] have been waiting 1m0s attempting to acquire lease
I200307 22:23:57.641964 23139566 storage/replica_range_lease.go:987  [n1,merge,s1,r214697/2:/Table/57/3/"xxxdata3…] slow lease acquisition finished after 1m0.000108221s with error [NotLeaseHolderError] r214697: replica (n1,s1):2 not lease holder; lease holder unknown after 1 attempts
W200307 22:35:24.446228 23162942 storage/replica_range_lease.go:982  [n1,consistencyChecker,s1,r158950/2:/Table/57/3/"kuda2014"/201…] have been waiting 1m0s attempting to acquire lease
I200307 22:35:24.446286 23162942 storage/replica_range_lease.go:987  [n1,consistencyChecker,s1,r158950/2:/Table/57/3/"kuda2014"/201…] slow lease acquisition finished after 1m0.000106958s with error [NotLeaseHolderError] r158950: replica (n1,s1):2 not lease holder; lease holder unknown after 1 attempts
W200307 22:35:24.446633 23162303 storage/replica_write.go:199  [n1,s1,r158950/2:/Table/57/3/"kuda2014"/201…] have been waiting 60.00s for proposing command RequestLease [/Table/57/3/"jkuda2014"/2019-06-16T11:16:42Z/150869193/1075758614/0,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	http://yourhost:80/#/reports/range/158950

and the following Raft status: {"id":"2","term":8,"vote":"1","commit":2602,"lead":"1","raftState":"StateFollower","applied":2602,"progress":{},"leadtransferee":"0"}
E200307 22:50:46.662305 23191951 storage/queue.go:829  [n1,merge,s1,r211445/3:/Table/57/1/173828{667…-888…}] aborted during DistSender.Send: context deadline exceeded
I200307 22:50:47.947328 23195800 internal/client/txn.go:618  [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded
W200307 22:50:49.447256 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
I200307 22:50:49.447284 2474 storage/node_liveness.go:453  [n1,hb] heartbeat failed on epoch increment; retrying
W200307 22:50:49.447315 23194845 storage/node_liveness.go:523  [n1,s1,r40785/11:/System/tsd/cr.node.c{ha…-lo…}] slow heartbeat took 2.8s
E200307 22:50:49.447329 23194845 storage/replica_range_lease.go:292  [n1,s1,r40785/11:/System/tsd/cr.node.c{ha…-lo…}] heartbeat failed on epoch increment
W200307 22:50:49.447345 23195883 storage/node_liveness.go:523  [n1,s1,r40793/8:/System/tsd/cr.node.sql.mem.a…] slow heartbeat took 2.8s
E200307 22:50:49.447357 23195883 storage/replica_range_lease.go:292  [n1,s1,r40793/8:/System/tsd/cr.node.sql.mem.a…] heartbeat failed on epoch increment
W200307 22:50:49.447425 23195881 storage/node_liveness.go:523  [n1,s1,r9423/5:/System/tsd/cr.node.sql.dists…] slow heartbeat took 2.8s
W200307 22:50:49.447316 23195879 storage/node_liveness.go:523  [n1,s1,r40667/11:/System/tsd/cr.node.exec.late…] slow heartbeat took 2.8s
E200307 22:50:49.447458 23195879 storage/replica_range_lease.go:292  [n1,s1,r40667/11:/System/tsd/cr.node.exec.late…] heartbeat failed on epoch increment
E200307 22:50:49.447441 23195881 storage/replica_range_lease.go:292  [n1,s1,r9423/5:/System/tsd/cr.node.sql.dists…] heartbeat failed on epoch increment
W200307 22:50:49.447490 23195885 storage/node_liveness.go:523  [n1,s1,r40826/6:/System/tsd/cr.node.sql.mem.c…] slow heartbeat took 2.8s
E200307 22:50:49.447508 23195885 storage/replica_range_lease.go:292  [n1,s1,r40826/6:/System/tsd/cr.node.sql.mem.c…] heartbeat failed on epoch increment
W200307 22:50:49.447587 23195887 storage/node_liveness.go:523  [n1,s1,r40681/9:/System/tsd/cr.node.sql.mem.i…] slow heartbeat took 2.8s
E200307 22:50:49.447600 23195887 storage/replica_range_lease.go:292  [n1,s1,r40681/9:/System/tsd/cr.node.sql.mem.i…] heartbeat failed on epoch increment
W200307 22:50:49.447697 23195590 storage/node_liveness.go:523  [n1,s1,r52165/5:/System/tsd/cr.node.sql.mem.s…] slow heartbeat took 2.8s
E200307 22:50:49.447716 23195590 storage/replica_range_lease.go:292  [n1,s1,r52165/5:/System/tsd/cr.node.sql.mem.s…] heartbeat failed on epoch increment
W200307 22:50:49.447806 23195592 storage/node_liveness.go:523  [n1,s1,r269449/13:/System/tsd/cr.node.sql.mem.s…] slow heartbeat took 2.8s
E200307 22:50:49.447816 23195592 storage/replica_range_lease.go:292  [n1,s1,r269449/13:/System/tsd/cr.node.sql.mem.s…] heartbeat failed on epoch increment
W200307 22:50:49.447878 23195593 storage/node_liveness.go:523  [n1,s1,r29444/14:/System/tsd/cr.node.sql.…] slow heartbeat took 2.8s
E200307 22:50:49.447891 23195593 storage/replica_range_lease.go:292  [n1,s1,r29444/14:/System/tsd/cr.node.sql.…] heartbeat failed on epoch increment
W200307 22:50:49.447989 23195339 storage/node_liveness.go:523  [n1,s1,r40847/7:/System/tsd/cr.node.sql.optim…] slow heartbeat took 2.8s
E200307 22:50:49.448002 23195339 storage/replica_range_lease.go:292  [n1,s1,r40847/7:/System/tsd/cr.node.sql.optim…] heartbeat failed on epoch increment
W200307 22:50:49.448161 23194496 storage/node_liveness.go:523  [n1,s1,r63/6:/System/tsd/cr.node.sql.servi…] slow heartbeat took 2.8s
E200307 22:50:49.448180 23194496 storage/replica_range_lease.go:292  [n1,s1,r63/6:/System/tsd/cr.node.sql.servi…] heartbeat failed on epoch increment
W200307 22:50:49.448254 23195889 storage/node_liveness.go:523  [n1,s1,r40658/11:/System/tsd/cr.node.s{ql…-ys…}] slow heartbeat took 2.8s
E200307 22:50:49.448267 23195889 storage/replica_range_lease.go:292  [n1,s1,r40658/11:/System/tsd/cr.node.s{ql…-ys…}] heartbeat failed on epoch increment
W200307 22:50:49.448364 23194498 storage/node_liveness.go:523  [n1,s1,r40816/2:/System/tsd/cr.node.sys.…] slow heartbeat took 2.8s
E200307 22:50:49.448377 23194498 storage/replica_range_lease.go:292  [n1,s1,r40816/2:/System/tsd/cr.node.sys.…] heartbeat failed on epoch increment
W200307 22:50:49.448471 23195849 storage/node_liveness.go:523  [n1,s1,r40848/13:/System/tsd/cr.node.txn.…] slow heartbeat took 2.8s
E200307 22:50:49.448492 23195849 storage/replica_range_lease.go:292  [n1,s1,r40848/13:/System/tsd/cr.node.txn.…] heartbeat failed on epoch increment
W200307 22:50:49.448593 23195925 storage/node_liveness.go:523  [n1,s1,r40639/8:/System/tsd/cr.node.txn.resta…] slow heartbeat took 2.8s
E200307 22:50:49.448617 23195925 storage/replica_range_lease.go:292  [n1,s1,r40639/8:/System/tsd/cr.node.txn.resta…] heartbeat failed on epoch increment
W200307 22:50:49.448769 23194558 storage/node_liveness.go:523  [n1,s1,r112133/11:/System/tsd/cr.store.c{a…-o…}] slow heartbeat took 2.8s
E200307 22:50:49.448803 23194558 storage/replica_range_lease.go:292  [n1,s1,r112133/11:/System/tsd/cr.store.c{a…-o…}] heartbeat failed on epoch increment
W200307 22:50:49.448871 23195595 storage/node_liveness.go:523  [n1,s1,r13188/11:/System/tsd/cr.{node.…-store…}] slow heartbeat took 2.8s
E200307 22:50:49.448884 23195595 storage/replica_range_lease.go:292  [n1,s1,r13188/11:/System/tsd/cr.{node.…-store…}] heartbeat failed on epoch increment
W200307 22:50:49.448960 23194560 storage/node_liveness.go:523  [n1,s1,r40822/6:/System/tsd/cr.store.compacto…] slow heartbeat took 2.8s
E200307 22:50:49.448979 23194560 storage/replica_range_lease.go:292  [n1,s1,r40822/6:/System/tsd/cr.store.compacto…] heartbeat failed on epoch increment
W200307 22:50:49.449035 23195341 storage/node_liveness.go:523  [n1,s1,r243661/11:/System/tsd/cr.store.l{e…-i…}] slow heartbeat took 2.8s
E200307 22:50:49.449051 23195341 storage/replica_range_lease.go:292  [n1,s1,r243661/11:/System/tsd/cr.store.l{e…-i…}] heartbeat failed on epoch increment
W200307 22:50:49.449314 23195343 storage/node_liveness.go:523  [n1,s1,r61/9:/System/tsd/cr.store.queue.…] slow heartbeat took 2.8s
E200307 22:50:49.449343 23195343 storage/replica_range_lease.go:292  [n1,s1,r61/9:/System/tsd/cr.store.queue.…] heartbeat failed on epoch increment
W200307 22:50:49.449363 23195346 storage/node_liveness.go:523  [n1,s1,r40788/8:/System/tsd/cr.store.queue.…] slow heartbeat took 2.8s
E200307 22:50:49.449381 23195346 storage/replica_range_lease.go:292  [n1,s1,r40788/8:/System/tsd/cr.store.queue.…] heartbeat failed on epoch increment
W200307 22:50:49.449386 23194846 storage/node_liveness.go:523  [n1,s1,r138517/7:/System/tsd/cr.store.queue.gc…] slow heartbeat took 2.8s
E200307 22:50:49.449403 23194846 storage/replica_range_lease.go:292  [n1,s1,r138517/7:/System/tsd/cr.store.queue.gc…] heartbeat failed on epoch increment
W200307 22:50:49.449454 23194562 storage/node_liveness.go:523  [n1,s1,r40676/12:/System/tsd/cr.store.queue.r…] slow heartbeat took 2.8s
E200307 22:50:49.449466 23194562 storage/replica_range_lease.go:292  [n1,s1,r40676/12:/System/tsd/cr.store.queue.r…] heartbeat failed on epoch increment
W200307 22:50:49.449524 23195940 storage/node_liveness.go:523  [n1,s1,r97964/12:/System/tsd/cr.store.{qu…-ra…}] slow heartbeat took 2.8s
E200307 22:50:49.449537 23195940 storage/replica_range_lease.go:292  [n1,s1,r97964/12:/System/tsd/cr.store.{qu…-ra…}] heartbeat failed on epoch increment
W200307 22:50:49.449641 23195956 storage/node_liveness.go:523  [n1,s1,r40782/10:/System/tsd/cr.store.raft.…] slow heartbeat took 2.8s
E200307 22:50:49.449653 23195956 storage/replica_range_lease.go:292  [n1,s1,r40782/10:/System/tsd/cr.store.raft.…] heartbeat failed on epoch increment
W200307 22:50:49.449729 23195942 storage/node_liveness.go:523  [n1,s1,r137464/8:/System/tsd/cr.store.raft.pro…] slow heartbeat took 2.8s
E200307 22:50:49.449742 23195942 storage/replica_range_lease.go:292  [n1,s1,r137464/8:/System/tsd/cr.store.raft.pro…] heartbeat failed on epoch increment
W200307 22:50:49.449810 23195959 storage/node_liveness.go:523  [n1,s1,r40638/9:/System/tsd/cr.store.txnwaitq…] slow heartbeat took 2.8s
E200307 22:50:49.449822 23195959 storage/replica_range_lease.go:292  [n1,s1,r40638/9:/System/tsd/cr.store.txnwaitq…] heartbeat failed on epoch increment
W200307 22:50:49.449880 23194847 storage/node_liveness.go:523  [n1,s1,r253504/7:/System{/tsd/cr.s…-tse}] slow heartbeat took 2.8s
E200307 22:50:49.449896 23194847 storage/replica_range_lease.go:292  [n1,s1,r253504/7:/System{/tsd/cr.s…-tse}] heartbeat failed on epoch increment
W200307 22:50:49.450118 23194050 storage/node_liveness.go:523  [n1,s1,r115946/2:/Table/57/4/2018-09-14T16:1…] slow heartbeat took 2.8s
E200307 22:50:49.450130 23194050 storage/replica_range_lease.go:292  [n1,s1,r115946/2:/Table/57/4/2018-09-14T16:1…] heartbeat failed on epoch increment
I200307 22:50:52.447454 23196053 internal/client/txn.go:618  [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded
W200307 22:50:53.115332 161 gossip/gossip.go:1501  [n1] first range unavailable; trying remaining resolvers

---- And lots of :

W200307 22:50:58.447759 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200307 22:50:58.447781 2474 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
I200307 22:50:58.499064 23193984 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200307 22:50:58.499843 23193984 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200307 22:50:59.499224 23195515 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200307 22:50:59.499941 23195515 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200307 22:51:00.499372 23195288 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200307 22:51:00.500226 23195288 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200307 22:51:01.448041 23195944 internal/client/txn.go:618  [n1] async rollback failed: aborted during DistSender.Send: context deadline exceeded
I200307 22:51:01.539656 23195222 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200307 22:51:01.541843 23195222 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200307 22:51:02.539561 23193906 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200307 22:51:02.540512 23193906 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200307 22:51:02.947910 2474 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s

Environment:

  • CockroachDB version [19.1]
  • Server OS: [Linux, centos7.2]
  • Client app [JDBC]

Additional context
we can't resolve it, the cluster we restart, but the impact also exist.
first range can't not found, gossip get nothing, rpc err, web can't not fetch any data.

Add any other context about the problem here.

@yangxuanjia yangxuanjia changed the title slow heart make the cluster crash, and can't not recover by restart slow heart make the cluster crash, and can not recover by restart Mar 10, 2020
@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 11, 2020

More logs below.

Click to expand
I200309 08:32:44.252327 1 util/log/clog.go:1199  [config] file created at: 2020/03/09 08:32:44
I200309 08:32:44.252327 1 util/log/clog.go:1199  [config] running on machine: cdb-mysql-1227165616205467648
I200309 08:32:44.252327 1 util/log/clog.go:1199  [config] binary: CockroachDB CCL v1.18.0 (x86_64-redhat-linux, built 2020/01/08 07:51:37, go1.11.6)
I200309 08:32:44.252327 1 util/log/clog.go:1199  [config] arguments: [/cockroach/cockroach --advertise-host=11.42.200.223 --store=path=/export/node,attrs=SSD,rocksdb=write_buffer_size=134217728;compression_per_level=kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;max_background_flushes=3;max_background_compactions=6 --http-addr 0.0.0.0:80 --locality country=zh,region=zh,datacenter=ht,rack=7-203-A-03,node=11.5.184.72 start --logtostderr --insecure --cache=25% --max-sql-memory=25% --join=cdb.cdb-mysql-kwtest.svc.htdb.n.jd.local:26257]
I200309 08:32:44.252327 1 util/log/clog.go:1199  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓
I200309 08:32:44.252313 1 cli/start.go:1034  logging to directory /export/node/logs
W200309 08:32:44.252647 1 cli/start.go:1069  RUNNING IN INSECURE MODE!

- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.

Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v1.18/secure-a-cluster.html
I200309 08:32:44.253339 1 cli/start.go:1083  CockroachDB CCL v1.18.0 (x86_64-redhat-linux, built 2020/01/08 07:51:37, go1.11.6)
I200309 08:32:44.357954 1 server/config.go:386  system total memory: 32 GiB
I200309 08:32:44.358023 1 server/config.go:388  server configuration:
max offset             500000000
cache size             8.0 GiB
SQL memory pool size   8.0 GiB
scan interval          10m0s
scan min idle time     10ms
scan max idle time     1s
event log enabled      true
I200309 08:32:44.358048 1 cli/start.go:930  process identity: uid 0 euid 0 gid 0 egid 0
I200309 08:32:44.358062 1 cli/start.go:554  starting cockroach node
I200309 08:32:44.358750 24 storage/engine/rocksdb.go:613  opening rocksdb instance at "/export/node/cockroach-temp731530701"
I200309 08:32:44.376813 24 server/server.go:879  [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I200309 08:32:44.377290 24 storage/engine/rocksdb.go:613  opening rocksdb instance at "/export/node"
I200309 08:32:44.981852 24 server/config.go:494  [n?] 1 storage engine initialized
I200309 08:32:44.981873 24 server/config.go:497  [n?] RocksDB cache size: 8.0 GiB
I200309 08:32:44.981881 24 server/config.go:497  [n?] store 0: RocksDB, max size 0 B, max open file limit 650350
W200309 08:32:44.984044 24 gossip/gossip.go:1496  [n?] no incoming or outgoing connections
I200309 08:32:44.984881 24 gossip/gossip.go:392  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"11.42.200.223:26257" > attrs:<> locality:<tiers:<key:"country" value:"zh" > tiers:<key:"region" value:"zh" > tiers:<key:"datacenter" value:"ht" > tiers:<key:"rack" value:"7-203-A-03" > tiers:<key:"node" value:"11.5.184.72" > > ServerVersion:<major_val:19 minor_val:1 patch:0 unstable:0 > build_tag:"v1.18.0" started_at:1583742764984625439 
I200309 08:32:44.988447 180 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:44.991228 180 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to 4 (11.42.200.224:26257)
I200309 08:32:44.992381 30 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:45.985321 3239 gossip/client.go:128  [n1] started gossip client to 11.42.223.117:26257
I200309 08:32:46.984660 9532 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:46.985136 9532 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:32:47.985836 15608 gossip/client.go:128  [n1] started gossip client to 11.42.200.225:26257
I200309 08:32:48.985095 21021 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:48.985666 21021 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:32:49.985193 27217 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:49.985707 27217 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:32:50.985278 244 storage/closedts/provider/provider.go:156  [ct-closer] unable to move closed timestamp forward: node not in the liveness table
I200309 08:32:50.985282 31426 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:50.985852 31426 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200309 08:32:51.985446 39032 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:51.985863 39032 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:32:52.985580 44890 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:52.986090 44890 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:32:53.985748 50505 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:53.986231 50505 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:32:54.986601 1752 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:54.989007 1752 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:32:55.986342 62150 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:55.986795 62150 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:32:56.986497 67688 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:56.987015 67688 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200309 08:32:57.988908 25919 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:57.990425 25919 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:32:58.988638 79305 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:32:58.989215 79305 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:32:59.988751 85136 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:32:59.989180 85136 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:00.988936 91393 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:00.989921 91393 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200309 08:33:01.989128 97193 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:01.990033 97193 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:02.989170 102983 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:02.989765 102983 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:03.272184 105253 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:03.272264 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:03.329139 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:03.424963 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:03.523941 24 server/node.go:461  [n1] initialized store [n1,s1]: disk (capacity=504 GiB, available=224 GiB, used=254 GiB, logicalBytes=851 GiB), ranges=56655, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=7776092.00 p25=10150537.00 p50=14612976.00 p75=20068196.00 p90=26103085.00 pMax=66847621.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I200309 08:33:03.524140 24 storage/stores.go:244  [n1] read 4 node addresses from persistent storage
I200309 08:33:03.524402 24 server/node.go:699  [n1] connecting to gossip network to verify cluster ID...
I200309 08:33:03.524428 24 server/node.go:719  [n1] node connected via gossip and verified as part of cluster "507fa2c5-7e92-4e6b-af52-ba0048a0dd84"
I200309 08:33:03.524519 24 server/node.go:542  [n1] node=1: started with [SSD=/export/node] engine(s) and attributes []
I200309 08:33:03.527532 24 server/server.go:1602  [n1] starting http server at [::]:80 (use: 0.0.0.0:80)
I200309 08:33:03.527546 24 server/server.go:1604  [n1] starting grpc/postgres server at [::]:26257
I200309 08:33:03.527558 24 server/server.go:1605  [n1] advertising CockroachDB node at 11.42.200.223:26257
W200309 08:33:03.527584 24 jobs/registry.go:343  [n1] unable to get node liveness: node not in the liveness table
W200309 08:33:03.596392 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:03.989371 106847 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:03.990036 106847 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:04.028870 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:04.890455 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:04.989461 73374 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:04.990060 73374 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:05.885475 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:05.989941 73573 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:05.990464 73573 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:06.786053 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:06.989925 41800 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:06.990582 41800 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:07.667438 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:07.990084 106809 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
W200309 08:33:08.028080 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:08.028104 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:08.272151 104751 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:08.528074 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
W200309 08:33:08.704938 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:08.990168 1773 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:08.990729 1773 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:09.679434 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:09.990317 19818 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:09.990780 19818 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:10.811419 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:10.990443 106368 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:10.990942 106368 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:33:11.028341 106883 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:11.737218 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:11.990607 33990 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:11.991173 33990 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:12.528594 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:12.528619 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:12.619363 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:12.990742 19821 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:12.991706 19821 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:13.272201 105082 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:13.528871 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:13.532688 105266 server/status/runtime.go:500  [n1] runtime stats: 2.7 GiB RSS, 691 goroutines, 699 MiB/67 MiB/888 MiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (24x), 46 MiB/28 MiB (r/w)net
W200309 08:33:13.663642 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:13.783646 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:33:13.990935 105286 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:13.991376 105286 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:14.629212 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:14.984165 33952 cli/start.go:544  The server appears to be unable to contact the other nodes in the cluster. Please try:

- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.

If problems persist, please see https://www.cockroachlabs.com/docs/v1.18/cluster-setup-troubleshooting.html.
I200309 08:33:14.991017 5149 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:14.991533 5149 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:15.525361 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:15.528945 106900 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:33:15.991171 9358 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:15.991852 9358 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:16.594235 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:16.991270 55663 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:16.992020 55663 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:17.029292 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:17.029333 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:17.462027 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:17.991364 9361 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:17.991835 9361 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:18.272200 105088 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:18.348504 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:18.529302 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:18.991538 105291 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:18.992274 105291 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:19.248858 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:19.991645 94789 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:19.992096 94789 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:20.029593 55666 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:20.396627 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:20.991804 94779 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:20.992336 94779 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:21.377020 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:21.529743 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:21.529765 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
I200309 08:33:21.992028 94782 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:21.992675 94782 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:22.249053 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:22.992157 1738 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:22.992824 1738 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:23.272210 104758 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:23.367768 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:23.530002 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:23.536133 105266 server/status/runtime.go:500  [n1] runtime stats: 2.9 GiB RSS, 691 goroutines, 671 MiB/188 MiB/1023 MiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 4815.7 CGO/sec, 218.2/35.9 %(u/s)time, 0.0 %gc (2x), 7.3 MiB/6.8 MiB (r/w)net
W200309 08:33:23.796280 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:33:23.992290 73525 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:23.992869 73525 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:24.422224 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:24.530203 106948 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:33:24.992418 106797 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:24.993336 106797 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:25.549749 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:25.992627 41804 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:25.993243 41804 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:26.030116 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:26.030137 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:26.403150 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:26.992713 73528 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:26.993199 73528 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:27.508661 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:27.992872 9378 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:27.993416 9378 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:28.272214 104991 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:28.499238 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:28.530462 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:28.992993 106920 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:28.993766 106920 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
I200309 08:33:29.030355 14654 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:29.529864 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:29.993207 5152 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:29.993750 5152 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:30.530562 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:30.530587 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:30.575096 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:30.993357 1778 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:30.994237 1778 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:31.551296 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:31.993436 107043 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:31.993919 107043 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:32.701087 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:32.993589 106821 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:32.994327 106821 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:33.272243 105141 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:33.530898 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:33.530997 9409 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:33:33.539454 105266 server/status/runtime.go:500  [n1] runtime stats: 2.9 GiB RSS, 692 goroutines, 869 MiB/23 MiB/1.0 GiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 2770.9 CGO/sec, 188.7/27.6 %(u/s)time, 0.0 %gc (1x), 3.2 MiB/5.0 MiB (r/w)net
W200309 08:33:33.682150 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:33.813361 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:33:33.993736 107107 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:33.994301 107107 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:34.669142 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:34.993822 41743 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:34.994527 41743 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:35.030967 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:35.030988 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:35.704782 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:35.993978 106869 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:35.994583 106869 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:36.598204 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:36.994155 56437 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:36.994857 56437 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:37.545387 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:37.994312 55874 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:37.995124 55874 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:38.031514 107139 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:38.272255 105188 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:38.527294 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:38.531336 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:38.994400 106812 gossip/client.go:128  [n1] started gossip client to 11.42.200.225:26257
W200309 08:33:39.531480 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:39.531526 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:39.560820 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:39.994544 107046 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:39.995115 107046 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:33:40.575716 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:40.994650 106872 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:40.995147 106872 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:41.666372 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:41.994883 14706 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:41.995561 14706 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
I200309 08:33:42.531902 14732 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:42.762582 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:42.996523 106626 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:42.997029 106626 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:43.272254 105186 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:43.531787 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:43.543093 105266 server/status/runtime.go:500  [n1] runtime stats: 3.2 GiB RSS, 692 goroutines, 553 MiB/457 MiB/1.3 GiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 4388.6 CGO/sec, 283.6/39.9 %(u/s)time, 0.0 %gc (4x), 14 MiB/4.1 MiB (r/w)net
W200309 08:33:43.665557 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:43.760970 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:33:43.996672 106964 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:43.997579 106964 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:44.031915 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:44.031934 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:44.643472 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:44.996823 106997 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:44.997327 106997 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:45.734063 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:45.996997 56442 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:45.997583 56442 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:46.819999 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:46.997167 106827 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:46.997637 106827 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:47.032292 106934 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:47.796625 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:47.997253 94785 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:47.997950 94785 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:48.272257 105196 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:48.532230 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
W200309 08:33:48.532345 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:48.532365 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:48.837590 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:48.997416 1744 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:49.016373 1744 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:49.912832 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:49.997518 107238 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:49.998362 107238 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:50.983093 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:50.986200 244 storage/closedts/provider/provider.go:156  [ct-closer] unable to move closed timestamp forward: node not in the liveness table
I200309 08:33:50.997621 107095 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:50.998100 107095 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:33:51.532680 41808 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:33:51.997896 1791 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:51.998933 1791 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:52.105668 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:52.997996 106938 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:52.998507 106938 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:53.032805 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:53.032835 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:33:53.250137 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:53.272260 104823 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:53.532679 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:53.546717 105266 server/status/runtime.go:500  [n1] runtime stats: 3.2 GiB RSS, 693 goroutines, 702 MiB/390 MiB/1.3 GiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 1390.2 CGO/sec, 181.3/26.3 %(u/s)time, 0.0 %gc (1x), 9.3 MiB/5.1 MiB (r/w)net
W200309 08:33:53.788404 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:33:53.998180 106489 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:53.998915 106489 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:33:54.346834 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:54.998326 107050 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
W200309 08:33:55.432125 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:55.998494 107172 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:55.999218 107172 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200309 08:33:56.033117 107080 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:33:56.308047 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:56.998583 107053 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:56.999014 107053 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:57.197046 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:57.533315 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:33:57.533340 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
I200309 08:33:57.998734 107347 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:57.999727 107347 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:33:58.267316 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:33:58.272269 105073 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:33:58.533161 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:33:58.998930 14715 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:33:58.999500 14715 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:33:59.169056 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:33:59.999011 107333 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:33:59.999884 107333 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200309 08:34:00.047726 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:00.533699 106876 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:34:00.999153 107113 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:00.999715 107113 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:34:01.010818 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:01.999298 106986 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:34:02.000026 106986 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:34:02.033702 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:34:02.033721 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:34:02.083317 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:34:02.265175 105259 storage/store_rebalancer.go:227  [n1,s1,store-rebalancer] StorePool missing descriptor for local store
I200309 08:34:02.999477 73535 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:03.000238 73535 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:34:03.070977 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:34:03.272240 105254 storage/replica_range_lease.go:982  [n1,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
W200309 08:34:03.272321 104792 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:34:03.272388 105252 storage/replica_range_lease.go:982  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200309 08:34:03.273526 1850 storage/replica_write.go:199  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 60.00s for proposing command RequestLease [/Min,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	http://yourhost:80/#/reports/range/1

and the following Raft status: {"id":"1","term":662,"vote":"0","commit":1494161,"lead":"0","raftState":"StatePreCandidate","applied":1494161,"progress":{},"leadtransferee":"0"}
W200309 08:34:03.273565 104687 storage/replica_write.go:199  [n1,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

	http://yourhost:80/#/reports/range/2

and the following Raft status: {"id":"1","term":638,"vote":"0","commit":2910309,"lead":"0","raftState":"StatePreCandidate","applied":2910309,"progress":{},"leadtransferee":"0"}
E200309 08:34:03.291633 105369 storage/queue.go:829  [n1,replicaGC,s1,r223930/1:/Table/57/2/1912{1981…-3586…}] aborted in distSender: context deadline exceeded
I200309 08:34:03.524832 105262 gossip/gossip.go:557  [n1] gossip status (stalled, 5 nodes)
gossip client (3/3 cur/max conns)
  4: 11.42.200.224:26257 (1m19s: infos 515/463 sent/received, bytes 98693B/111198B sent/received)
  2: 11.42.223.117:26257 (1m18s: infos 504/467 sent/received, bytes 96811B/110648B sent/received)
  3: 11.42.200.225:26257 (25s: infos 169/149 sent/received, bytes 33502B/36702B sent/received)
gossip server (1/3 cur/max conns, infos 2894/2292 sent/received, bytes 623408B/537172B sent/received)
  5: 11.43.107.41:26257 (1m16s)
gossip connectivity
  n1 -> n2; n1 -> n3; n1 -> n4; n2 -> n3; n2 -> n4; n2 -> n5; n3 -> n4; n3 -> n5; n4 -> n5; n5 -> n1;
W200309 08:34:03.528160 106846 storage/replica_range_lease.go:982  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200309 08:34:03.532197 106806 storage/replica_range_lease.go:982  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200309 08:34:03.533576 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:34:03.550162 105266 server/status/runtime.go:500  [n1] runtime stats: 3.2 GiB RSS, 692 goroutines, 561 MiB/436 MiB/1.3 GiB GO alloc/idle/total, 1.8 GiB/2.1 GiB CGO alloc/total, 1946.1 CGO/sec, 204.3/25.2 %(u/s)time, 0.0 %gc (2x), 7.9 MiB/3.2 MiB (r/w)net
W200309 08:34:03.782422 106792 storage/replica_range_lease.go:982  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200309 08:34:03.792337 105261 server/node.go:799  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available
I200309 08:34:03.999676 107269 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:34:04.000491 107269 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:34:04.055728 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:34:04.555123 41819 storage/replica_range_lease.go:982  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
I200309 08:34:04.999761 107125 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:05.000273 107125 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
I200309 08:34:05.034013 106968 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
W200309 08:34:05.128343 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:06.000012 19825 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:34:06.000712 19825 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:34:06.147577 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:34:06.534120 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:34:06.534148 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
I200309 08:34:07.000088 106913 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:07.000829 106913 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:34:07.156467 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:08.000269 107021 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:34:08.001108 107021 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200309 08:34:08.257155 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W200309 08:34:08.272325 105008 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200309 08:34:08.534040 106842 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200309 08:34:09.000374 107443 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:09.000878 107443 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:34:09.247534 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:09.534381 106879 internal/client/txn.go:618  [n1] async rollback failed: aborted in distSender: context deadline exceeded
I200309 08:34:10.000949 56503 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200309 08:34:10.002585 56503 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200309 08:34:10.152092 105253 storage/store.go:1525  [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I200309 08:34:11.000865 106881 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200309 08:34:11.001287 106881 gossip/client.go:133  [n1] closing client to n4 (11.42.200.224:26257): rpc error: code = Unknown desc = duplicate connection from node at 11.42.200.223:26257
W200309 08:34:11.034506 106841 storage/node_liveness.go:523  [n1,hb] slow heartbeat took 4.5s
W200309 08:34:11.034537 106841 storage/node_liveness.go:463  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
W200309 08:34:11.191540 104688 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 3: EOF:
W200309 08:34:11.191596 15649 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {11.42.200.225:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

@yangxuanjia
Copy link
Author

hello?

@knz knz added the O-community Originated from the community label Mar 12, 2020
@ajwerner
Copy link
Contributor

I'm sorry to hear that you're having a bad experience. Slow heartbeats and node liveness failures can occur when a cluster is overloaded. In 19.1 and earlier, CPU overload could result in this sort of node liveness failure. With #39172 we separated network traffic for these critical system ranges which seems to mitigate the problem somewhat. IO saturation can also cause problems.

It is generally bad practice to run a database at the point of resource saturation though I'll admit that more graceful degradation would be great. My concrete recommendation would be to lower the data rate until things are happy and then experiment to figure out the capacity of the cluster for your workload. In the face of overload, cockroach provides you the ability to scale out.

We would love to continue to increase the robustness of the cluster to overload. Understanding your workload will help us to learn how to do that.

Can you provide more information about the hardware on which this cluster is deployed, the resource utilization on the servers while this is running as well as the data rate and shape that you're pushing? What sort of storage device are you using? Would you be willing to collect screen shots of the admin UI dashboards while problems are occurring (Storage, Hardware, Runtime, Distributed, Replication)?

@yangxuanjia
Copy link
Author

@ajwerner
Thank you for your reply and focus, I know the reason about you said now, but the most important and urgent thing is how can I recover the cluster.
when I restart the cluster, the log said node liveness want to communicate to other node, first it need get raft lease holder, but system lease holder always unknown, so node liveness always expired, distSender always context deadline.
The problem I known and understand, do I have any method to resolve and let the cluster recover? Because user's data all in the cluster, and I can do nothing now.
Because the cluster can not recover, so I can't connect the cluster use sql, web can't show any data, and the most problem is user data in the cluster, and I can't migrate them to other cluster now.
I will focus this problem for long term, because it make user's data unsafe and loss.
Because the web can't fetch any data from the cluster now, so I can't get the snapshot for the metric data you want.

@ajwerner
Copy link
Contributor

Sorry I think I missed something earlier. There’s something weird going on in those logs with “duplicate addresses”. Have you swapped nodes storage and IP addresses in any way? Are the nodes being passed the correct join flags?

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 13, 2020

@ajwerner
I got some log when I restart one node of the cluster, maybe you can watch any problem.

I200313 04:49:23.151448 1 util/log/clog.go:1199  [config] file created at: 2020/03/13 04:49:23
I200313 04:49:23.151448 1 util/log/clog.go:1199  [config] running on machine: cdb-mysql-1227165616205467648
I200313 04:49:23.151448 1 util/log/clog.go:1199  [config] binary: CockroachDB CCL v1.18.0-26-g6bf2783-dirty (x86_64-redhat-linux, built 2020/03/12 13:29:06, go1.11.6)
I200313 04:49:23.151448 1 util/log/clog.go:1199  [config] arguments: [/cockroach/cockroach --advertise-host=11.42.200.223 --listen-addr=11.42.200.223:26257 --store=path=/export/node,attrs=SSD,rocksdb=write_buffer_size=134217728;compression_per_level=kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;max_background_flushes=3;max_background_compactions=6 --http-addr 0.0.0.0:80 --locality country=zh,region=zh,datacenter=ht,rack=7-203-A-03,node=11.5.184.72 start --logtostderr --insecure --cache=25% --max-sql-memory=25% --join=cdb.cdb-mysql-kwtest.svc.htdb.n.jd.local:26257]
I200313 04:49:23.151448 1 util/log/clog.go:1199  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓
I200313 04:49:23.151432 1 cli/start.go:1034  logging to directory /export/node/logs
W200313 04:49:23.151714 1 cli/start.go:1069  RUNNING IN INSECURE MODE!

- Your cluster is open for any client that can access 11.42.200.223.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.

Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v1.18/secure-a-cluster.html
I200313 04:49:23.153817 1 cli/start.go:1083  CockroachDB CCL v1.18.0-26-g6bf2783-dirty (x86_64-redhat-linux, built 2020/03/12 13:29:06, go1.11.6)
I200313 04:49:23.258084 1 server/config.go:386  system total memory: 16 GiB
I200313 04:49:23.258145 1 server/config.go:388  server configuration:
max offset             500000000
cache size             4.0 GiB
SQL memory pool size   4.0 GiB
scan interval          10m0s
scan min idle time     10ms
scan max idle time     1s
event log enabled      true
I200313 04:49:23.258168 1 cli/start.go:930  process identity: uid 0 euid 0 gid 0 egid 0
I200313 04:49:23.258179 1 cli/start.go:554  starting cockroach node
I200313 04:49:23.258887 112 storage/engine/rocksdb.go:613  opening rocksdb instance at "/export/node/cockroach-temp110940235"
I200313 04:49:23.278495 112 server/server.go:879  [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I200313 04:49:23.279172 112 storage/engine/rocksdb.go:613  opening rocksdb instance at "/export/node"
I200313 04:49:26.889377 112 server/config.go:494  [n?] 1 storage engine initialized
I200313 04:49:26.889405 112 server/config.go:497  [n?] RocksDB cache size: 4.0 GiB
I200313 04:49:26.889413 112 server/config.go:497  [n?] store 0: RocksDB, max size 0 B, max open file limit 650350
W200313 04:49:26.895443 112 gossip/gossip.go:1496  [n?] no incoming or outgoing connections
I200313 04:49:26.895966 112 server/node.go:435  [n?] [DEBUG] n.storeCfg.Gossip.NodeID.Set(ctx, n.Descriptor.NodeID)
I200313 04:49:26.895984 112 server/node.go:438  [n1] [DEBUG] n.storeCfg.Gossip.SetNodeDescriptor(&n.Descriptor)
I200313 04:49:26.896252 112 gossip/gossip.go:392  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"11.42.200.223:26257" > attrs:<> locality:<tiers:<key:"country" value:"zh" > tiers:<key:"region" value:"zh" > tiers:<key:"datacenter" value:"ht" > tiers:<key:"rack" value:"7-203-A-03" > tiers:<key:"node" value:"11.5.184.72" > > ServerVersion:<major_val:19 minor_val:1 patch:0 unstable:0 > build_tag:"v1.18.0-26-g6bf2783-dirty" started_at:1584074966895917040
I200313 04:49:26.897787 112 server/node.go:443  [n1] [DEBUG] n.storeCfg.ClosedTimestamp.Start(n.Descriptor.NodeID)
I200313 04:49:26.898144 112 server/node.go:447  [n1] [DEBUG] before initStores
I200313 04:49:26.900413 153 gossip/client.go:128  [n1] started gossip client to 11.42.200.224:26257
I200313 04:49:26.904368 112 server/node.go:453  [n1] [DEBUG] after initStores
I200313 04:49:26.904702 112 server/node.go:456  [n1] [DEBUG] before store start: %!v(PANIC=runtime error: invalid memory address or nil pointer dereference)
I200313 04:49:27.896964 5357 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:27.897521 5357 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
I200313 04:49:28.897062 10944 gossip/client.go:128  [n1] started gossip client to 11.42.223.117:26257
I200313 04:49:31.098597 22913 gossip/client.go:128  [n1] started gossip client to 11.42.200.225:26257
W200313 04:49:32.898299 259 storage/closedts/provider/provider.go:156  [ct-closer] unable to move closed timestamp forward: node not in the liveness table
I200313 04:49:33.151727 33982 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:33.152264 33982 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
I200313 04:49:34.151847 39288 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:34.152351 39288 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200313 04:49:35.152032 44950 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:35.152484 44950 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
I200313 04:49:36.152134 50527 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:36.152814 50527 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200313 04:49:37.152251 55874 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:37.152924 55874 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200313 04:49:38.152554 61279 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:38.153247 61279 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
I200313 04:49:39.152561 66869 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:39.153092 66869 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
I200313 04:49:40.152858 72413 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:40.153535 72413 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
200313 04:49:46.828357 112 server/node.go:460  [n1] [DEBUG] after store start: [n1,s1]
W200313 04:49:46.828406 109522 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
I200313 04:49:46.828434 108927 storage/replica.go:503  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] before maybeInitializeRaftGroup
I200313 04:49:46.828472 108824 storage/replica.go:503  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] before maybeInitializeRaftGroup
W200313 04:49:46.828489 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:46.829119 108824 storage/replica.go:505  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:46.829165 108824 storage/replica.go:528  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, useRaft:RequestLease [/Min,/Min)
I200313 04:49:46.829183 108824 storage/replica_write.go:69  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 11111111111
I200313 04:49:46.829184 108927 storage/replica.go:505  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:46.829201 108824 storage/replica_write.go:73  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 22222222222222
I200313 04:49:46.829209 108927 storage/replica.go:528  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, replica, useRaft:RequestLease [/System/NodeLiveness,/Min)
I200313 04:49:46.829226 108824 storage/replica_write.go:100  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 333333333333
I200313 04:49:46.829235 108824 storage/replica_write.go:105  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 44444444444
I200313 04:49:46.829249 108824 storage/replica_write.go:116  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 6666666666666
I200313 04:49:46.829275 108824 storage/replica_write.go:126  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 77777777777
I200313 04:49:46.829287 108824 storage/replica_write.go:149  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 88888888888888
I200313 04:49:46.829290 108927 storage/replica_write.go:69  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 11111111111
I200313 04:49:46.829303 108927 storage/replica_write.go:73  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 22222222222222
I200313 04:49:46.829317 108927 storage/replica_write.go:100  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 333333333333
I200313 04:49:46.829343 108927 storage/replica_write.go:105  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 44444444444
I200313 04:49:46.829352 108927 storage/replica_write.go:116  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 6666666666666
I200313 04:49:46.829359 108927 storage/replica_write.go:126  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 77777777777
I200313 04:49:46.829367 108927 storage/replica_write.go:149  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 88888888888888
I200313 04:49:46.829633 108927 storage/replica_write.go:165  [n1,s1,r2/1:/System/NodeLiveness{-Max}] [DEBUG] yxj, 999999999999999999
I200313 04:49:46.829639 108824 storage/replica_write.go:165  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, 999999999999999999
I200313 04:49:46.848706 109602 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta1/Table/57/4/2016-05-12T11:44:08Z/2337203/314043721/0/NULL/NULL,/Meta2/"")
I200313 04:49:46.848753 109602 storage/stores.go:187  [n1] [DEBUG] after with error GetStore ba:Scan [/Meta1/Table/57/4/2016-05-12T11:44:08Z/2337203/314043721/0/NULL/NULL,/Meta2/"") err:store 1 was not found
W200313 04:49:46.880600 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
W200313 04:49:46.994733 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:47.105722 112 server/node.go:469  [n1] initialized store [n1,s1]: disk (capacity=886 GiB, available=335 GiB, used=254 GiB, logicalBytes=850 GiB), ranges=56647, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=7773768.00 p25=10149660.00 p50=14609122.00 p75=20061550.00 p90=26090826.00 pMax=66847621.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I200313 04:49:47.106325 112 storage/stores.go:259  [n1] read 4 node addresses from persistent storage
I200313 04:49:47.106617 112 server/node.go:707  [n1] connecting to gossip network to verify cluster ID...
I200313 04:49:47.106654 112 server/node.go:727  [n1] node connected via gossip and verified as part of cluster "507fa2c5-7e92-4e6b-af52-ba0048a0dd84"
I200313 04:49:47.106723 112 server/node.go:550  [n1] node=1: started with [SSD=/export/node] engine(s) and attributes []
I200313 04:49:47.109324 112 server/server.go:1602  [n1] starting http server at [::]:80 (use: 0.0.0.0:80)
I200313 04:49:47.109342 112 server/server.go:1604  [n1] starting grpc/postgres server at 11.42.200.223:26257
I200313 04:49:47.109354 112 server/server.go:1605  [n1] advertising CockroachDB node at 11.42.200.223:26257
W200313 04:49:47.109388 112 jobs/registry.go:343  [n1] unable to get node liveness: node not in the liveness table
W200313 04:49:47.109479 111031 storage/node_liveness.go:449  StartHeartbeat, nl.mu.nodes: map[]
W200313 04:49:47.109502 111031 storage/node_liveness.go:450  StartHeartbeat, liveness: <nil>
W200313 04:49:47.109517 111031 storage/node_liveness.go:451  StartHeartbeat, incrementEpoch: true
W200313 04:49:47.109526 111031 storage/node_liveness.go:452  StartHeartbeat, err: node not in the liveness table
I200313 04:49:47.109586 111031 storage/node_liveness.go:576  [n1,hb] [DEBUG] before updateLiveness update:{Liveness:{NodeID:1 Epoch:1 Expiration:1584074996.609531423,0 Draining:false Decommissioning:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ignoreCache:false}
I200313 04:49:47.109607 111031 storage/node_liveness.go:787  [n1,hb] [DEBUG] before updateLivenessAttempt
W200313 04:49:47.109620 111031 storage/node_liveness.go:809  updateLivenessAttempt, update: {{1 1 1584074996.609531423,0 false false {} 0} false}, oldliveness: <nil>
I200313 04:49:47.109644 111031 storage/node_liveness.go:817  [n1,hb] [DEBUG] before txn
I200313 04:49:47.109708 111031 storage/node_liveness.go:819  [n1,hb] [DEBUG] in txn
I200313 04:49:47.109746 111031 storage/node_liveness.go:828  [n1,hb] [DEBUG] in txn key:/System/NodeLiveness/1
I200313 04:49:47.109745 111036 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta2/System/"system-version/^@",/System/"")
I200313 04:49:47.109822 111036 storage/stores.go:194  [n1] [DEBUG] after GetStore ba:Scan [/Meta2/System/"system-version/^@",/System/"")
I200313 04:49:47.109833 111036 storage/stores.go:195  [n1] [DEBUG] before Send
I200313 04:49:47.109855 111036 storage/store.go:2788  [n1,s1] [DEBUG] before SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.109865 111036 storage/store.go:2793  [n1,s1] [DEBUG] after SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.109890 111036 storage/store.go:2962  [n1,s1] [DEBUG] before repl Send ba:Scan [/Meta2/System/"system-version/^@",/System/"")
I200313 04:49:47.109896 111031 storage/node_liveness.go:844  [n1,hb] [DEBUG] in txn before run b:&{0xc0147ec6c0 [0: /Min=nil ] {0.000000000,0 (n0,s0):? 0 normal <nil> CONSISTENT 0 false false 0 <nil> false false false {} 0} [{0xc019dca448 {} 0} {0xc019dca450 {} 0}] true <nil> <nil> [{1 <nil> [{[] <nil>}] [] {[] [] {} 0} 0 []} {1 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []}] [] [{[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>}] 1}
I200313 04:49:47.109906 111036 storage/replica.go:503  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] before maybeInitializeRaftGroup
I200313 04:49:47.109921 111036 storage/replica.go:505  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:47.110000 111036 storage/replica.go:532  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, isReadOnly:Scan [/Meta2/System/"system-version/^@",/System/"")
I200313 04:49:47.110329 110879 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta2/System/NodeLiveness/1/NULL,/System/"")
I200313 04:49:47.110342 110879 storage/stores.go:194  [n1] [DEBUG] after GetStore ba:Scan [/Meta2/System/NodeLiveness/1/NULL,/System/"")
I200313 04:49:47.110349 110879 storage/stores.go:195  [n1] [DEBUG] before Send
I200313 04:49:47.110360 110879 storage/store.go:2788  [n1,s1] [DEBUG] before SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.110366 110879 storage/store.go:2793  [n1,s1] [DEBUG] after SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.110457 110879 storage/store.go:2962  [n1,s1] [DEBUG] before repl Send ba:Scan [/Meta2/System/NodeLiveness/1/NULL,/System/"")
I200313 04:49:47.110468 110879 storage/replica.go:503  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] before maybeInitializeRaftGroup
I200313 04:49:47.110473 110879 storage/replica.go:505  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:47.110486 110879 storage/replica.go:532  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, isReadOnly:Scan [/Meta2/System/NodeLiveness/1/NULL,/System/"")
I200313 04:49:47.133697 329 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta2/System/tsd/cr.node.build.timestamp/1^@/10s/2020-03-13T04:00:00Z,/System/"")
I200313 04:49:47.133730 329 storage/stores.go:194  [n1] [DEBUG] after GetStore ba:Scan [/Meta2/System/tsd/cr.node.build.timestamp/1^@/10s/2020-03-13T04:00:00Z,/System/"")
I200313 04:49:47.133741 329 storage/stores.go:195  [n1] [DEBUG] before Send
I200313 04:49:47.133759 329 storage/store.go:2788  [n1,s1] [DEBUG] before SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.133770 329 storage/store.go:2793  [n1,s1] [DEBUG] after SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.133881 329 storage/store.go:2962  [n1,s1] [DEBUG] before repl Send ba:Scan [/Meta2/System/tsd/cr.node.build.timestamp/1^@/10s/2020-03-13T04:00:00Z,/System/"")
I200313 04:49:47.133900 329 storage/replica.go:503  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] before maybeInitializeRaftGroup
I200313 04:49:47.133906 329 storage/replica.go:505  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:47.133921 329 storage/replica.go:532  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, isReadOnly:Scan [/Meta2/System/tsd/cr.node.build.timestamp/1^@/10s/2020-03-13T04:00:00Z,/System/"")
I200313 04:49:47.153798 10971 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:47.154355 10971 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200313 04:49:47.174396 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:47.436164 110961 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta2/System/StatusNode/1/NULL,/System/"")
I200313 04:49:47.436187 110961 storage/stores.go:194  [n1] [DEBUG] after GetStore ba:Scan [/Meta2/System/StatusNode/1/NULL,/System/"")
I200313 04:49:47.436192 110961 storage/stores.go:195  [n1] [DEBUG] before Send
I200313 04:49:47.436204 110961 storage/store.go:2788  [n1,s1] [DEBUG] before SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.436211 110961 storage/store.go:2793  [n1,s1] [DEBUG] after SetActiveTimestamp s:[n1,s1]
I200313 04:49:47.438793 110961 storage/store.go:2962  [n1,s1] [DEBUG] before repl Send ba:Scan [/Meta2/System/StatusNode/1/NULL,/System/"")
I200313 04:49:47.438809 110961 storage/replica.go:503  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] before maybeInitializeRaftGroup
I200313 04:49:47.438816 110961 storage/replica.go:505  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] after maybeInitializeRaftGroup
I200313 04:49:47.438834 110961 storage/replica.go:532  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, isReadOnly:Scan [/Meta2/System/StatusNode/1/NULL,/System/"")
W200313 04:49:47.544624 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:48.112204 111046 storage/stores.go:182  [n1] [DEBUG] before GetStore ba:Scan [/Meta2/Table/3/1/NULL,/System/"")
I200313 04:49:48.112228 111046 storage/stores.go:194  [n1] [DEBUG] after GetStore ba:Scan [/Meta2/Table/3/1/NULL,/System/"")
I200313 04:49:48.112233 111046 storage/stores.go:195  [n1] [DEBUG] before Send
I200313 04:49:48.112246 111046 storage/store.go:2788  [n1,s1] [DEBUG] before SetActiveTimestamp s:[n1,s1]
I200313 04:49:48.112255 111046 storage/store.go:2793  [n1,s1] [DEBUG] after SetActiveTimestamp s:[n1,s1]
I200313 04:49:48.112331 111046 storage/store.go:2962  [n1,s1] [DEBUG] before repl Send ba:Scan [/Meta2/Table/3/1/NULL,/System/"")
I200313 04:49:48.153922 331 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:48.154604 331 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200313 04:49:48.324751 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:49.154018 43894 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:49.154536 43894 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200313 04:49:49.424597 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:49:56.155101 111156 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:56.155796 111156 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n4 (11.42.200.224:26257); already have active connection, skipping
W200313 04:49:56.446886 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
W200313 04:49:56.828341 109398 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200313 04:49:56.895622 111092 cli/start.go:544  The server appears to be unable to contact the other nodes in the cluster. Please try:

- starting the other nodes, if you haven't already;
- double-checking that the '--join' and '--listen'/'--advertise' flags are set up correctly;
- running the 'cockroach init' command if you are trying to initialize a new cluster.

If problems persist, please see https://www.cockroachlabs.com/docs/v1.18/cluster-setup-troubleshooting.html.
W200313 04:49:57.110384 111032 jobs/registry.go:343  unable to get node liveness: node not in the liveness table
I200313 04:49:57.113413 110608 server/status/runtime.go:500  [n1] runtime stats: 2.8 GiB RSS, 691 goroutines, 425 MiB/229 MiB/888 MiB GO alloc/idle/total, 1.8 GiB/2.2 GiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (26x), 46 MiB/27 MiB (r/w)net
I200313 04:49:57.155250 110881 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:49:57.155895 110881 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200313 04:49:57.334480 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:50:03.611281 7865 storage/replica.go:547  [n1,s1,r1/1:/{Min-System/NodeL…}] [DEBUG] yxj, replica, br:<nil>, pErr: [NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; lease holder unknown
I200313 04:50:03.611304 7865 storage/store.go:2968  [n1,s1] [DEBUG] after repl Send with error br:<nil> err:[NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; lease holder unknown
I200313 04:50:03.611320 7865 storage/stores.go:202  [n1] [DEBUG] after Send br:<nil> pErr:[NotLeaseHolderError] r1: replica (n1,s1):1 not lease holder; lease holder unknown
I200313 04:50:36.614719 111031 storage/node_liveness.go:849  [n1,hb] [DEBUG] after txn with error, err:aborted in distSender: context deadline exceeded
I200313 04:50:36.614727 111031 storage/node_liveness.go:789  [n1,hb] [DEBUG] error updateLivenessAttempt err:aborted in distSender: context deadline exceeded
W200313 04:50:36.614740 111031 storage/node_liveness.go:528  [n1,hb] slow heartbeat took 4.5s
W200313 04:50:36.614748 111031 storage/node_liveness.go:468  [n1,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
I200313 04:50:46.162538 12520 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:50:46.163129 12520 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200313 04:50:46.475640 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
W200313 04:50:46.828395 109110 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200313 04:50:46.828470 109523 storage/replica_range_lease.go:985  [n1,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 1m0s attempting to acquire lease
W200313 04:50:46.828547 109521 storage/replica_range_lease.go:985  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200313 04:50:46.829823 108824 storage/replica_write.go:203  [n1,s1,r1/1:/{Min-System/NodeL…}] have been waiting 60.00s for proposing command RequestLease [/Min,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

        http://yourhost:80/#/reports/range/1

and the following Raft status: {"id":"1","term":1579,"vote":"4","commit":1516549,"lead":"0","raftState":"StatePreCandidate","applied":1516549,"progress":{},"leadtransferee":"0"}
W200313 04:50:46.829862 108927 storage/replica_write.go:203  [n1,s1,r2/1:/System/NodeLiveness{-Max}] have been waiting 60.00s for proposing command RequestLease [/System/NodeLiveness,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

        http://yourhost:80/#/reports/range/2

and the following Raft status: {"id":"1","term":1791,"vote":"1","commit":2940622,"lead":"0","raftState":"StatePreCandidate","applied":2940622,"progress":{},"leadtransferee":"0"}
E200313 04:50:46.848521 109601 storage/queue.go:829  [n1,replicaGC,s1,r49624/2:/Table/57/4/2016-05-1{2T…-4T…}] aborted in distSender: context deadline exceeded

@yangxuanjia
Copy link
Author

and the last of the log as follow:

I200313 04:52:29.704424 111522 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:52:29.705025 111522 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200313 04:52:30.024536 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:52:30.704688 112136 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:52:30.705296 112136 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n3 (11.42.200.225:26257); already have active connection, skipping
W200313 04:52:30.921525 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown
I200313 04:52:31.704717 112631 gossip/client.go:128  [n1] started gossip client to 11.43.107.41:26257
I200313 04:52:31.705285 112631 gossip/client.go:133  [n1] closing client to n5 (11.43.107.41:26257): received forward from n5 to n2 (11.42.223.117:26257); already have active connection, skipping
W200313 04:52:31.828601 109078 storage/replica_range_lease.go:506  can't determine lease status due to node liveness error: node not in the liveness table
W200313 04:52:31.997689 109522 storage/store.go:1525  [n1,s1,r6/6:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):6 not lease holder; lease holder unknown

@yangxuanjia
Copy link
Author

and the start shell as follow:

[root@cdb-mysql-1227165616205467648 logs]# ps -ef|grep cock
root     21471     1 99 12:49 ?        00:18:39 /cockroach/cockroach --advertise-host=11.42.200.223 --listen-addr=11.42.200.223:26257 --store=path=/export/node,attrs=SSD,rocksdb=write_buffer_size=134217728;compression_per_level=kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;max_background_flushes=3;max_background_compactions=6 --http-addr 0.0.0.0:80 --locality country=zh,region=zh,datacenter=ht,rack=7-203-A-03,node=11.5.184.72 start --logtostderr --insecure --cache=25% --max-sql-memory=25% --join=cdb.cdb-mysql-kwtest.svc.htdb.n.jd.local:26257

@yangxuanjia
Copy link
Author

I restart 5 node in cluster, and use join ips, not dns, the problem also have.
/cockroach/cockroach start --insecure --advertise-host=11.42.200.223 --listen-addr=11.42.200.223:26257 --http-addr 0.0.0.0:80 --locality country=zh,region=zh,datacenter=ht,rack=7-203-A-03,node=11.5.184.72 --logtostderr --cache=25% --max-sql-memory=25% --join=11.42.200.223:26257,11.43.107.41:26257,11.42.200.224:26257,11.42.200.225:26257,11.42.223.117:26257 --store=path=/export/node,attrs=SSD,rocksdb="write_buffer_size=134217728;compression_per_level=kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;max_background_flushes=3;max_background_compactions=6"

I200313 08:16:12.271754 22058 storage/store.go:2962  [n5,s5] [DEBUG] before repl Send ba:Scan [/Meta2/Table/57/1/162425371/1079063421/NULL,/System/"")
I200313 08:16:12.278081 22258 storage/stores.go:182  [n5] [DEBUG] before GetStore ba:Scan [/Meta2/Table/57/3/"jd_aojingshuma"/2019-04-22T11:56:53Z/140120398/1067865363/0/NULL,/System/"")
I200313 08:16:12.278120 22258 storage/stores.go:194  [n5] [DEBUG] after GetStore ba:Scan [/Meta2/Table/57/3/"jd_aojingshuma"/2019-04-22T11:56:53Z/140120398/1067865363/0/NULL,/System/"")
I200313 08:16:12.278129 22258 storage/stores.go:195  [n5] [DEBUG] before Send
I200313 08:16:12.278144 22258 storage/store.go:2788  [n5,s5] [DEBUG] before SetActiveTimestamp s:[n5,s5]
I200313 08:16:12.279952 22258 storage/store.go:2793  [n5,s5] [DEBUG] after SetActiveTimestamp s:[n5,s5]
I200313 08:16:12.281820 22258 storage/store.go:2962  [n5,s5] [DEBUG] before repl Send ba:Scan [/Meta2/Table/57/3/"jd_aojingshuma"/2019-04-22T11:56:53Z/140120398/1067865363/0/NULL,/System/"")
I200313 08:16:12.499562 22016 internal/client/txn.go:618  [n5] async rollback failed: aborted during DistSender.Send: context deadline exceeded
W200313 08:16:13.434287 17811 storage/replica_range_lease.go:982  [n5,gc,s5,r255341/3:/Table/57/1/20338{4523â<80>¦-6386â<80>¦}] have been waiting 1m0s attempting to acquire lease
I200313 08:16:13.434414 17811 storage/replica_range_lease.go:987  [n5,gc,s5,r255341/3:/Table/57/1/20338{4523â<80>¦-6386â<80>¦}] slow lease acquisition finished after 1m0.000208326s with error [NotLeaseHolderError] r255341: replica (n5,s5):3 not lease holder; lease holder unknown after 1 attempts
I200313 08:16:13.999603 5537 storage/node_liveness.go:846  [n5,hb] [DEBUG] in txn after run b:&{0xc026bd3d40 [0: /Min=nil ] {0.000000000,0 (n0,s0):? 0 normal <nil> CONSISTENT 0 false false 0 <nil> false false false {} 0} [{0xc0284cb240 {} 0} {0xc0284cb248 {} 0}] true <nil> 0xc020c1c060 [{1 <nil> [{[] <nil>}] [] {[] [] {} 0} 0 []} {1 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []} {0 <nil> [] [] {[] [] {} 0} 0 []}] [] [{[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>} {[] <nil>}] 1} err:result is ambiguous (context done during DistSender.Send: context deadline exceeded)
I200313 08:16:13.999656 5537 storage/node_liveness.go:849  [n5,hb] [DEBUG] after txn with error, err:result is ambiguous (context done during DistSender.Send: context deadline exceeded)
I200313 08:16:13.999675 5537 storage/node_liveness.go:789  [n5,hb] [DEBUG] error updateLivenessAttempt err:storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
I200313 08:16:13.999686 5537 storage/node_liveness.go:792  [n5,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
W200313 08:16:13.999707 5537 storage/node_liveness.go:528  [n5,hb] slow heartbeat took 4.5s
W200313 08:16:13.999731 5537 storage/node_liveness.go:468  [n5,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200313 08:16:13.999769 5537 storage/node_liveness.go:449  StartHeartbeat, nl.mu.nodes: map[5:{5 16 1584087365.500409493,0 false false {} 0} 1:{1 14 1584085930.520997219,0 false false {} 0} 2:{2 19 1584086629.346032774,0 false false {} 0} 3:{3 16 1584087367.514125824,0 false false {} 0} 4:{4 18 1584087366.842963978,0 false false {} 0}]
W200313 08:16:16.953781 78 gossip/gossip.go:1501  [n5] first range unavailable; trying remaining resolvers
I200313 08:16:16.954011 21850 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:16.954592 21850 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:16.999981 21790 internal/client/txn.go:618  [n5] async rollback failed: aborted during DistSender.Send: context deadline exceeded
I200313 08:16:17.954209 21904 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:17.954768 21904 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
W200313 08:16:18.499902 5537 storage/node_liveness.go:528  [n5,hb] slow heartbeat took 4.5s
W200313 08:16:18.499979 5537 storage/node_liveness.go:468  [n5,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200313 08:16:18.500039 5537 storage/node_liveness.go:449  StartHeartbeat, nl.mu.nodes: map[1:{1 14 1584085930.520997219,0 false false {} 0} 2:{2 19 1584086629.346032774,0 false false {} 0} 3:{3 16 1584087367.514125824,0 false false {} 0} 4:{4 18 1584087366.842963978,0 false false {} 0} 5:{5 16 1584087365.500409493,0 false false {} 0}]
W200313 08:16:18.500089 5537 storage/node_liveness.go:450  StartHeartbeat, liveness: node_id:5 epoch:16 expiration:<wall_time:1584087365500409493 logical:0 >
W200313 08:16:18.500100 5537 storage/node_liveness.go:451  StartHeartbeat, incrementEpoch: false
W200313 08:16:18.500107 5537 storage/node_liveness.go:452  StartHeartbeat, err: <nil>
I200313 08:16:18.954646 21936 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:18.955341 21936 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:19.954726 22043 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:19.955764 22043 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
E200313 08:16:20.727373 22000 server/admin.go:1163  admin-getUIData: context canceled
E200313 08:16:20.727415 22000 server/admin.go:1234  rpc error: code = Internal desc = An internal server error has occurred. Please check your CockroachDB logs for more details.
E200313 08:16:20.787736 22306 server/status.go:966  [n5,status] aborted during DistSender.Send: context canceled
I200313 08:16:20.954798 22403 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:20.955478 22403 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:21.061195 5530 server/status/runtime.go:500  [n5] runtime stats: 5.9 GiB RSS, 470 goroutines, 455 MiB/244 MiB/883 MiB GO alloc/idle/total, 4.5 GiB/5.2 GiB CGO alloc/total, 3995.9 CGO/sec, 208.1/25.2 %(u/s)time, 0.0 %gc (3x), 3.8 MiB/3.5 MiB (r/w)net
E200313 08:16:21.844358 21921 server/admin.go:1163  admin-getUIData: context canceled
E200313 08:16:21.844415 21921 server/admin.go:1234  rpc error: code = Internal desc = An internal server error has occurred. Please check your CockroachDB logs for more details.
I200313 08:16:21.954900 22164 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:21.955555 22164 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
I200313 08:16:22.955041 22199 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:22.955633 22199 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
W200313 08:16:23.000116 5537 storage/node_liveness.go:528  [n5,hb] slow heartbeat took 4.5s
W200313 08:16:23.000190 5537 storage/node_liveness.go:468  [n5,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200313 08:16:23.000190 5537 storage/node_liveness.go:468  [n5,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200313 08:16:23.000247 5537 storage/node_liveness.go:449  StartHeartbeat, nl.mu.nodes: map[4:{4 18 1584087366.842963978,0 false false {} 0} 5:{5 16 1584087365.500409493,0 false false {} 0} 1:{1 14 1584085930.520997219,0 false false {} 0} 2:{2 19 1584086629.346032774,0 false false {} 0} 3:{3 16 1584087367.514125824,0 false false {} 0}]
W200313 08:16:23.000298 5537 storage/node_liveness.go:450  StartHeartbeat, liveness: node_id:5 epoch:16 expiration:<wall_time:1584087365500409493 logical:0 >
W200313 08:16:23.000314 5537 storage/node_liveness.go:451  StartHeartbeat, incrementEpoch: false
W200313 08:16:23.000322 5537 storage/node_liveness.go:452  StartHeartbeat, err: <nil>
I200313 08:16:23.955270 22118 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:23.955922 22118 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
W200313 08:16:24.012205 18305 storage/replica_range_lease.go:982  [n5,timeSeriesMaintenance,s5,r40731/3:/System/tsd/cr.node.{clâ<80>¦-diâ<80>¦}] have been waiting 1m0s attempting to acquire lease
I200313 08:16:24.012272 18305 storage/replica_range_lease.go:987  [n5,timeSeriesMaintenance,s5,r40731/3:/System/tsd/cr.node.{clâ<80>¦-diâ<80>¦}] slow lease acquisition finished after 1m0.000177009s with error [NotLeaseHolderError] r40731: replica (n5,s5):3 not lease holder; lease holder unknown after 1 attempts
W200313 08:16:38.763587 5272 storage/store.go:3710  [n5,s5,r1/4:/{Min-System/NodeLâ<80>¦}] handle raft ready: 0.6s [processed=0]
I200313 08:16:38.957508 22189 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:38.958203 22189 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:39.957566 21951 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:39.958214 21951 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
I200313 08:16:40.957697 22214 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:40.958501 22214 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:40.994823 5526 gossip/gossip.go:557  [n5] gossip status (stalled, 3 nodes)
gossip client (0/3 cur/max conns)
gossip server (2/3 cur/max conns, infos 2236/2321 sent/received, bytes 696228B/711788B sent/received)
  3: 11.42.200.225:26257 (3m21s)
  4: 11.42.200.224:26257 (3m20s)
gossip connectivity
  n3 -> n4; n3 -> n5; n4 -> n5;
W200313 08:16:41.001006 5537 storage/node_liveness.go:528  [n5,hb] slow heartbeat took 4.5s
W200313 08:16:41.001060 5537 storage/node_liveness.go:468  [n5,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
W200313 08:16:41.001109 5537 storage/node_liveness.go:449  StartHeartbeat, nl.mu.nodes: map[1:{1 14 1584085930.520997219,0 false false {} 0} 2:{2 19 1584086629.346032774,0 false false {} 0} 3:{3 16 1584087367.514125824,0 false false {} 0} 4:{4 18 1584087366.842963978,0 false false {} 0} 5:{5 16 1584087365.500409493,0 false false {} 0}]
W200313 08:16:41.001151 5537 storage/node_liveness.go:450  StartHeartbeat, liveness: node_id:5 epoch:16 expiration:<wall_time:1584087365500409493 logical:0 >
W200313 08:16:41.001160 5537 storage/node_liveness.go:451  StartHeartbeat, incrementEpoch: false
W200313 08:16:41.001167 5537 storage/node_liveness.go:452  StartHeartbeat, err: <nil>
I200313 08:16:41.071239 5530 server/status/runtime.go:500  [n5] runtime stats: 5.9 GiB RSS, 482 goroutines, 683 MiB/126 MiB/950 MiB GO alloc/idle/total, 4.5 GiB/5.2 GiB CGO alloc/total, 4999.4 CGO/sec, 224.9/22.8 %(u/s)time, 0.0 %gc (3x), 3.6 MiB/2.3 MiB (r/w)net
I200313 08:16:41.957891 22311 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:41.958597 22311 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
I200313 08:16:42.958000 22393 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:42.958680 22393 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
I200313 08:16:43.958121 22063 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:16:43.958692 22063 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
I200313 08:16:44.958297 22080 gossip/client.go:128  [n5] started gossip client to 11.42.200.225:26257
I200313 08:16:44.959130 22080 gossip/client.go:133  [n5] closing client to n3 (11.42.200.225:26257): stopping outgoing client to n3 (11.42.200.225:26257); already have incoming
W200313 08:16:44.970961 5523 storage/store_rebalancer.go:227  [n5,s5,store-rebalancer] StorePool missing descriptor for local store
I200313 08:17:05.962211 22767 gossip/client.go:128  [n5] started gossip client to 11.42.200.224:26257
I200313 08:17:05.962793 22767 gossip/client.go:133  [n5] closing client to n4 (11.42.200.224:26257): stopping outgoing client to n4 (11.42.200.224:26257); already have incoming
I200313 08:17:06.387383 13220 storage/replica_range_lease.go:987  [n5,s5,r45192/4:/Table/57/1/3{03498/â<80>¦-10926/â<80>¦}] slow lease acquisition finished after 2m51.198104485s with error [NotLeaseHolderError] r45192: replica (n5,s5):4 not lease holder; lease holder unknown after 1 attempts
I200313 08:17:06.387459 13220 storage/store.go:2968  [n5,s5] [DEBUG] after repl Send with error br:<nil> err:[NotLeaseHolderError] r45192: replica (n5,s5):4 not lease holder; lease holder unknown
I200313 08:17:06.387490 13220 storage/stores.go:202  [n5] [DEBUG] after Send br:<nil> pErr:[NotLeaseHolderError] r45192: replica (n5,s5):4 not lease holder; lease holder unknown
I200313 08:17:06.387705 13286 storage/replica_write.go:218  [n5,s5,r45192/4:/Table/57/1/3{03498/â<80>¦-10926/â<80>¦}] slow command RequestLease [/Table/57/1/303498/134893362,/Min) finished after 171.20s with error result is ambiguous (context canceled)

@ajwerner
Copy link
Contributor

I’m not sure what’s going on and likely won’t have a lot of time to investigate your situation in the coming days.

I encourage you to look closely at all of the network address of all of the servers. Are you connecting anything through a load balancer? Consider mapping out all of the servers, their IP addresses, and the flags your are passing to each of them to start.

@yangxuanjia
Copy link
Author

@ajwerner
I am so sad to hear that, but I understand. We use --join=DNS flag for the cluster, and DNS is cname to the load balance ip. We use this method half year and always ok.
Yesterday, I try to use --join=ip1,ip2,ip3,ip4,ip5 flag to the cluster, but it also have the same problem. It is really an ambiguous problem which can't make me have a clear mind. I will try again.
Thanks for your time.

@yangxuanjia
Copy link
Author

I have a question about, what situation r1 range can not get lease holder?

I200314 04:12:06.030196 94777 storage/replica.go:547  [n2,s2,r1/5:/{Min-System/NodeL…}] [DEBUG] yxj, replica, br:<nil>, pErr: [NotLeaseHolderError] r1: replica (n2,s2):5 not lease holder; lease holder unknown

@ajwerner
Copy link
Contributor

This seems due to the nodes not being able to communicate at startup. Providing a load balancer as the join address can be dangerous. I’m sorry to hear that using the individual IPs still did not work. When you did this, were the logs the same about closing the connection?

Perhaps consider 1) backing up the store data for each node jus to be safe, 2) using the same servers to start a new cluster but with a different store directory on each node to try to debug the networking situation.

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 16, 2020

@ajwerner
Today, I do 2 experiments,
One: I use the 5 docker, mv data dir to another dir, same as I startup a new cluster, and the cluster is running ok. So I the it is not network problem.
Secend: I copy the origin data dir to use, the cluster also have the problem, and the web only can live about half minute, and I can see some metric data, and I will paste it to you. After half minute,
the cluster became abnormal again.

We add the debug log, We found it's maybe the raft or range have problem.
And we add testing_knobs.DisableProcessRaft=true at local, we found the log is very looking the same. But We don't understand how to resolve the problem for the online cluster.

@yangxuanjia
Copy link
Author

Screenshot from 2020-03-16 17-11-37
Screenshot from 2020-03-16 17-11-48
Screenshot from 2020-03-16 17-11-53
Screenshot from 2020-03-16 17-12-04
Screenshot from 2020-03-16 17-12-12
Screenshot from 2020-03-16 17-12-19
Screenshot from 2020-03-16 17-12-25
Screenshot from 2020-03-16 17-12-38
Screenshot from 2020-03-16 17-12-45
Screenshot from 2020-03-16 17-12-51
Screenshot from 2020-03-16 17-13-38

@yangxuanjia
Copy link
Author

In the half minute cluster live time, I shot some range info. Looks like range have leader but can't not get lease holder.
Screenshot from 2020-03-17 14-30-41
Screenshot from 2020-03-17 14-31-17
Screenshot from 2020-03-17 14-31-43
Screenshot from 2020-03-17 14-32-54
Screenshot from 2020-03-17 14-33-08
Screenshot from 2020-03-17 14-33-21

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 17, 2020

Do you have any methods or knobs let the cluster only use raft leader(not use raft lease holder)?

W200317 06:39:16.128985 1410 storage/store.go:1525  [n2,s2,r1/5:/{Min-System/NodeL…}] could not gossip first range descriptor: [NotLeaseHolderError] r1: replica (n2,s2):5 not lease holder; lease holder unknown
W200317 06:38:37.229896 182 gossip/gossip.go:1501  [n2] first range unavailable; trying remaining resolvers
W200317 06:39:00.094036 182 gossip/gossip.go:1499  [n2] first range unavailable; resolvers exhausted
W200317 06:40:16.183135 1410 storage/replica_range_lease.go:982  [n2,s2,r1/5:/{Min-System/NodeL…}] have been waiting 1m0s attempting to acquire lease
W200317 06:40:16.191106 11194 storage/replica_write.go:199  [n2,s2,r1/5:/{Min-System/NodeL…}] have been waiting 60.00s for proposing command RequestLease [/Min,/Min).
This range is likely unavailable.
Please submit this message at

  https://github.com/cockroachdb/cockroach/issues/new/choose

along with

        https://yourhost:8080/#/reports/range/1

@yangxuanjia
Copy link
Author

I see a system range, r2.
Screenshot from 2020-03-17 15-14-54
Screenshot from 2020-03-17 15-15-48
Screenshot from 2020-03-17 15-15-55

@yangxuanjia
Copy link
Author

r1 is maybe a ok range, because I also can't see lease holder.
Screenshot from 2020-03-17 15-45-38
Screenshot from 2020-03-17 15-45-51
Screenshot from 2020-03-17 15-46-00

@yangxuanjia
Copy link
Author

other invalid range and no raft leader range looks like all has leader and follower, but lease holder is underreplicated(or slow) unavailable or invalid lease.
Screenshot from 2020-03-17 15-41-29
Screenshot from 2020-03-17 15-41-39
Screenshot from 2020-03-17 15-42-17
Screenshot from 2020-03-17 15-42-24
Screenshot from 2020-03-17 15-42-34
Screenshot from 2020-03-17 15-42-42
Screenshot from 2020-03-17 15-42-51

@ajwerner
Copy link
Contributor

You need to get a quorum on the node liveness range to bootstrap leases to the other ranges. It seems like n4 is the leader and has the most up-to-date log for the range you showed. I’m not certain that this is the node liveness range. Try starting that node first.

If you collect a debug zip and provide access to the source code (it seems that there are some modifications) it is possible that I try to help better understand the situation but no promises

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 19, 2020

@ajwerner
I try to restart n4 first but it's not ok.
After the tests in these days, I have a idea for my guess. I think maybe it's an design problem.

  1. node liveness update depend on raft lease holder.
  2. raft lease holder elected depend on node liveness through gossip.
    the node liveness and raft lease holder depend on each other and to be dead loop.

I think the problem is,

  1. lots of write and read make the network heavy
  2. node liveness heartbeat became slow
  3. raft lease holder found node liveness lost, lease holder became expired and to be invalid lease.
  4. node liveness table range, begin became invalid lease.
  5. node liveness heartbeat can't update by gossip, because node liveness table range not have lease holder.
  6. the node liveness and raft lease holder depend on each other and to be dead loop.
  7. more and more ranges lost their lease holder, because node liveness heartbeat can't update.
  8. more than half ranges became invalid lease, the whole cluster can't serve.
    PS: It's the whole step I guess.

When I restart the cluster and hope the cluster can recover by itself. But...

  1. node liveness table range are invalid lease, and node liveness heartbeat can't update by gossip, because node liveness table range not have lease holder.
  2. the node liveness and raft lease holder depend on each other and to be dead loop.
  3. more and more ranges lost their lease holder, because node liveness heartbeat can't update.
  4. more than half ranges became invalid lease, the whole cluster can't serve.
    the cluster can live about half minute, we can read data and select table. After half minute the cluster can't serve.
    PS: So the cluster can not recover by itself.

@ajwerner
Copy link
Contributor

I’m not exactly sure what you mean by node liveness raft lease depends on node liveness. Most ranges indeed use node liveness based leases but the node liveness range uses a different, expiration-based leasing mechanism.

@yangxuanjia
Copy link
Author

I mean node liveness update need through txn, and txn need through raft, and need raft lease holder.
but it's look like the node liveness range have not lease holder.

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 19, 2020

@ajwerner
Good news, I finally recover the cluster, and use cockroach dump tool to dump all user data. And all ranges of the cluster look good.
Now i can select * from t1, create database d1, create table t2, insert into t2, select * from t2, and all ops work good.

I hack some code, the main mind have 2 points.

  1. let system range of node liveness can got lease holder
  2. make all range to be valid

hack code:

  1. src/github.com/cockroachdb/cockroach/pkg/storage/node_liveness.go
func (nl *NodeLiveness) heartbeatInternal(
 ctx context.Context, liveness *storagepb.Liveness, incrementEpoch bool,
) error {
 // ...... 
 {
 maxOffset := nl.clock.MaxOffset()
 if maxOffset == timeutil.ClocklessMaxOffset {
  maxOffset = 0
 }
 update.Expiration = hlc.LegacyTimestamp(
  nl.clock.Now().Add((nl.livenessThreshold + maxOffset).Nanoseconds(), 0))
 // This guards against the system clock moving backwards. As long
 // as the cockroach process is running, checks inside hlc.Clock
 // will ensure that the clock never moves backwards, but these
 // checks don't work across process restarts.
 if liveness != nil && update.Expiration.Less(liveness.Expiration) {
  // yxj hack
  // return errors.Errorf("proposed liveness update expires earlier than previous record")
 }
}
}

func (nl *NodeLiveness) GetLiveness(nodeID roachpb.NodeID) (*storagepb.Liveness, error) {
 nl.mu.Lock()
 defer nl.mu.Unlock()
 // yxj hack
 // return nl.getLivenessLocked(nodeID)
 return &storagepb.Liveness{
  NodeID:     nodeID,
  Epoch:      500,
  Expiration: hlc.LegacyTimestamp(hlc.MaxTimestamp),
 }, nil
}

2.src/github.com/cockroachdb/cockroach/pkg/storage/store_pool.go

func MakeStorePoolNodeLivenessFunc(nodeLiveness *NodeLiveness) NodeLivenessFunc {
 return func(nodeID roachpb.NodeID, now time.Time, threshold time.Duration) storagepb.NodeLivenessStatus {
  // yxj hack
  //liveness, err := nodeLiveness.GetLiveness(nodeID)
  //if err != nil {
  // return storagepb.NodeLivenessStatus_UNAVAILABLE
  //}
  //return liveness.LivenessStatus(now, threshold, nodeLiveness.clock.MaxOffset())
  return storagepb.NodeLivenessStatus_LIVE
 }
}

3.src/github.com/cockroachdb/cockroach/pkg/storage/storagepb/liveness.go

// yxj hack
// "github.com/cockroachdb/cockroach/pkg/util/timeutil"
 
func (l *Liveness) IsLive(now hlc.Timestamp, maxOffset time.Duration) bool {
 // yxj hack
 //if maxOffset == timeutil.ClocklessMaxOffset {
 // // When using clockless reads, we're live without a buffer period.
 // maxOffset = 0
 //}
 //expiration := hlc.Timestamp(l.Expiration).Add(-maxOffset.Nanoseconds(), 0)
 //return now.Less(expiration)
 return true
}

func (l *Liveness) IsDead(now hlc.Timestamp, threshold time.Duration) bool {
 // yxj hack
 //deadAsOf := hlc.Timestamp(l.Expiration).GoTime().Add(threshold)
 //return !now.GoTime().Before(deadAsOf)
 return false
}

func (l *Liveness) LivenessStatus(
 now time.Time, threshold, maxOffset time.Duration,
) NodeLivenessStatus {
 // yxj hack
 //nowHlc := hlc.Timestamp{WallTime: now.UnixNano()}
 //if l.IsDead(nowHlc, threshold) {
 // if l.Decommissioning {
 //  return NodeLivenessStatus_DECOMMISSIONED
 // }
 // return NodeLivenessStatus_DEAD
 //}
 //if l.Decommissioning {
 // return NodeLivenessStatus_DECOMMISSIONING
 //}
 //if l.Draining {
 // return NodeLivenessStatus_UNAVAILABLE
 //}
 //if l.IsLive(nowHlc, maxOffset) {
 // return NodeLivenessStatus_LIVE
 //}
 //return NodeLivenessStatus_UNAVAILABLE
 return NodeLivenessStatus_LIVE
}

4.src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go

func (r *Replica) leaseStatus(
 lease roachpb.Lease, timestamp, minProposedTS hlc.Timestamp,
) storagepb.LeaseStatus {
 status := storagepb.LeaseStatus{Timestamp: timestamp, Lease: lease}
 var expiration hlc.Timestamp
 if lease.Type() == roachpb.LeaseExpiration {
  expiration = lease.GetExpiration()
  // yxj hack
  expiration = hlc.MaxTimestamp
 } else {
  var err error
  status.Liveness, err = r.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID)
  // yxj hack
  status.Liveness.Epoch = lease.Epoch
  if err != nil || status.Liveness.Epoch < lease.Epoch {
   // If lease validity can't be determined (e.g. gossip is down
   // and liveness info isn't available for owner), we can neither
   // use the lease nor do we want to attempt to acquire it.
   if err != nil {
    if leaseStatusLogLimiter.ShouldLog() {
     log.Warningf(context.TODO(), "can't determine lease status due to node liveness error: %s", err)
    }
   }
   status.State = storagepb.LeaseState_ERROR
   return status
  }
  if status.Liveness.Epoch > lease.Epoch {
   status.State = storagepb.LeaseState_EXPIRED
   return status
  }
  expiration = hlc.Timestamp(status.Liveness.Expiration)
 }
 maxOffset := r.store.Clock().MaxOffset()
 if maxOffset == timeutil.ClocklessMaxOffset {
  // No stasis when using clockless reads.
  maxOffset = 0
 }
 stasis := expiration.Add(-int64(maxOffset), 0)
 if timestamp.Less(stasis) {
  status.State = storagepb.LeaseState_VALID
  // If the replica owns the lease, additional verify that the lease's
  // proposed timestamp is not earlier than the min proposed timestamp.
 
  // yxj hack
  //if lease.Replica.StoreID == r.store.StoreID() &&
  // lease.ProposedTS != nil && lease.ProposedTS.Less(minProposedTS) {
  // status.State = storagepb.LeaseState_PROSCRIBED
  //}
 } else if timestamp.Less(expiration) {
  status.State = storagepb.LeaseState_STASIS
 } else {
  status.State = storagepb.LeaseState_EXPIRED
 }
 return status
}

now, the world quietness......

@yangxuanjia
Copy link
Author

yangxuanjia commented Mar 19, 2020

Screenshot from 2020-03-19 16-57-35
Screenshot from 2020-03-19 16-57-49
Screenshot from 2020-03-19 17-14-41
Screenshot from 2020-03-19 17-14-51

@yangxuanjia
Copy link
Author

@ajwerner
finally, I will do 2 things.

  1. cockroach dump user data, protect user data. the dump sql can recover to another cluster.
  2. replace the code by v19.1, look the cluster whether work normally.
    And thanks for your time and your advise, it give me giant help.

@ajwerner
Copy link
Contributor

Happy to hear you got it recovered. Not sure what happened. I don't think this issue is readily actionable so I'm going to close it. If you find an issue with an easier reproduction, please open a new issue.

@luiscosio
Copy link

@yangxuanjia How did you manage to bring online the cluster? Experiencing the same issue as you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

4 participants