storage: pick up upstream fix for Raft entries handling #28918

jordanlewis · 2018-08-21T22:09:03Z

Running a 3-node TPCC cluster on roachprod. Under no load, one of the nodes panicked:

E180821 22:00:16.110425 180 storage/replica.go:5367  [n1,s1,r604/1:/Table/54/{2/945/1…-3}] unable to update the state machine: applied index jumped from 26183 to 26185
E180821 22:00:16.110649 180 util/log/crash_reporting.go:477  [n1,s1,r604/1:/Table/54/{2/945/1…-3}] Reported as error bcb7784105ae42b9b06ca477937f8521
F180821 22:00:16.110679 180 storage/replica.go:5372  [n1,s1,r604/1:/Table/54/{2/945/1…-3}] applied index jumped from 26183 to 26185
goroutine 180 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204a0500, 0xc4204a05a0, 0x3c3bf00, 0x12)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:965 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x453ba00, 0xc400000004, 0x3c3bf48, 0x12, 0x14fc, 0xc420c74370, 0x50)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:834 +0x804
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2e91a20, 0xc4208da600, 0xc400000004, 0x2, 0x0, 0x0, 0xc4287c90d8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2e91a20, 0xc4208da600, 0x1, 0x4, 0x0, 0x0, 0xc4287c90d8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2e91a20, 0xc4208da600, 0xc4287c90d8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc420859180, 0x2e91a20, 0xc4208da600, 0xc43075e5f0, 0x8, 0x9, 0x6649, 0x300000003, 0x2, 0x7, ...)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5372 +0xaa8
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc420859180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4232 +0x12fa
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x2e91a20, 0xc42d378d50, 0xc420859180, 0x2e91a20)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3685 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc420035400, 0x2e91a20, 0xc42d378d50, 0xc4296b3040, 0xc4287c9ed0, 0x0)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3300 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420035400, 0x2e91a20, 0xc426982210, 0x25c)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3673 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4201cfd40, 0x2e91a20, 0xc426982210)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:226 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2e91a20, 0xc426982210)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc424ea8cf0, 0xc420032480, 0xc424ea8ce0)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

The text was updated successfully, but these errors were encountered:

jordanlewis · 2018-08-21T22:16:07Z

Oh wait, this was on @nvanbenschoten and my sketchy count * branch.

nvanbenschoten · 2018-08-21T22:25:14Z

Mind pointing to that branch? I don't think we're doing anything that would have an effect on a range's raft log in that change, but I'd like to be sure. It will also be helpful for lining up the stack trace.

jordanlewis · 2018-08-21T22:28:22Z

https://github.com/jordanlewis/cockroach/tree/count-rows

jordanlewis · 2018-08-22T00:03:24Z

I captured a snapshot of the data directory on the first node which died. It's 71 gigabytes, uncompressed. I can upload it, but I'm assuming nobody will actually want to look at that.

nvanbenschoten · 2018-08-22T01:17:57Z

@jordanlewis did anything strange happen on that node before this? Was it recently restarted? Was it about to run out of memory? This is on the jordan-tpcc cluster, right? I'd like to at least get a look at its raft log.

jordanlewis · 2018-08-22T01:21:13Z

Nothing weird happened on the node. It was the node I was executing those COUNT queries for the demo on, but nothing besides that. The logs are available - jordan-tpcc:1. But I'm running other load on the cluster at the moment. The tarred directory is available in /mnt/data1 - probably don't try to extract it all at once or you'll run the node out of disk space.

Let's follow up about this offline.

nvanbenschoten · 2018-08-22T01:25:01Z

@tschottdorf how do these log.Fatals beneath Raft interact with the raftMu? They don't unwind the stack and unlock raftMu, potentially allowing a second goroutine to enter handleRaftReadyRaftMuLocked and see inconsistent state, right? Would that be a risk if we ever panicked while processing a raft command?

nvanbenschoten · 2018-08-22T01:25:31Z

Let's follow up about this offline.

SGTM

benesch · 2018-08-22T01:25:33Z

FYI your nodes are running slightly different binaries:

$ roachprod run jordan-tpcc '~jordan/cockroach version'
jordan-tpcc: ~jordan/cockroach version 4/4
   1: Build Tag:    v2.1.0-alpha.20180702-1850-gf8620200b
Build Time:   2018/08/22 00:08:44
Distribution: CCL
Platform:     linux amd64 (x86_64-linux-gnu)
Go Version:   go1.10.3
C Compiler:   gcc 7.3.0
Build SHA-1:  f8620200b0ed08c868316365ba5ae76bbe4441ef
Build Type:   development
   2: Build Tag:    v2.1.0-alpha.20180702-1850-gf8620200b
Build Time:   2018/08/22 00:08:44
Distribution: CCL
Platform:     linux amd64 (x86_64-linux-gnu)
Go Version:   go1.10.3
C Compiler:   gcc 7.3.0
Build SHA-1:  f8620200b0ed08c868316365ba5ae76bbe4441ef
Build Type:   development
   3: Build Tag:    v2.1.0-alpha.20180702-1850-gf8620200b
Build Time:   2018/08/22 00:08:44
Distribution: CCL
Platform:     linux amd64 (x86_64-linux-gnu)
Go Version:   go1.10.3
C Compiler:   gcc 7.3.0
Build SHA-1:  f8620200b0ed08c868316365ba5ae76bbe4441ef
Build Type:   development
   4: Build Tag:    v2.1.0-alpha.20180702-1850-gf8620200b
Build Time:   2018/08/21 20:49:04
Distribution: CCL
Platform:     linux amd64 (x86_64-linux-gnu)
Go Version:   go1.10.3
C Compiler:   gcc 7.3.0
Build SHA-1:  f8620200b0ed08c868316365ba5ae76bbe4441ef
Build Type:   development

Note the different build time on n4. I realize the build SHA is the same, but I'm not super confident in our build SHA injection. Is it possible this is to blame?

benesch · 2018-08-22T01:27:34Z

I thought we'd removed most of the panics in storage in #16482. Have some snuck back in?

jordanlewis · 2018-08-22T01:32:53Z

@benesch, n4 isn't part of the cluster.

jordanlewis · 2018-08-22T01:33:44Z

(To be more clear, I was intending to use it as the load generator in some tests - and I never started Cockroach on it or joined it to the cluster.)

jordanlewis · 2018-08-22T01:34:36Z

Also, the "panic" I observed was a log.Fatal. Sorry for misleading!

benesch · 2018-08-22T01:48:23Z

@benesch, n4 isn't part of the cluster.

Oh good. By which I mean, oh no.

Also, the "panic" I observed was a log.Fatal. Sorry for misleading!

No worries—I think @nvanbenschoten was worried about other panics in the storage package. Of which there are many:

$ git grep 'panic(' -- ':!*_test.go' | wc -l
      15

@nvanbenschoten log.Fatal calls os.Exit so it will not release any locks or unwind the stack at all.

jordanlewis · 2018-08-22T02:39:42Z

This reproduced, on a different node, soon after startup. No longer on the count(*) branch, either.

180822 02:26:10.370774 185 storage/replica.go:5401  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] unable to update the state machine: applied index jumped from 2443716 to 2443718
E180822 02:26:10.374425 185 util/log/crash_reporting.go:477  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] Reported as error 19a3626e1f924077b8aef04045c9a6ee
F180822 02:26:10.374443 185 storage/replica.go:5406  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] applied index jumped from 2443716 to 2443718
goroutine 185 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204cc500, 0xc4204cc5a0, 0x3d85d00, 0x12)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:965 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x468d3a0, 0xc400000004, 0x3d85dbf, 0x12, 0x151e, 0xc4298267e0, 0x52)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:834 +0x804
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x2f43f00, 0xc420c5a810, 0xc400000004, 0x2, 0x0, 0x0, 0xc42a3a8fe8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:154 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x2f43f00, 0xc420c5a810, 0x1, 0x4, 0x0, 0x0, 0xc42a3a8fe8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(0x2f43f00, 0xc420c5a810, 0xc42a3a8fe8, 0x1, 0x1)
        /home/jordan/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:181 +0x6c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc420856700, 0x2f43f00, 0xc420c5a810, 0xc4262483e0, 0x8, 0x1b, 0x2549c6, 0x200000002, 0x3, 0xc, ...)

tbg · 2018-08-22T06:04:39Z

Are there logs of this or a cluster that I can look at?

tbg · 2018-08-22T06:22:13Z

Ah, jordan-tpcc1. Taking a look.

roachprod ssh jordan-tpcc:1-3 "grep -hR 'r604/' /home/jordan/logs/cockroach.jordan*" | grep -E '^[IWEF]' | cut -c '2-' | sort | less

So first of all, jordan-tpcc:3 is actually node2. I mention this because I can't really say much about n1 (jordan-tpcc:1) -- the first entry in the logs for r604 is actually the crash! There's literally nothing else in them about that range. Nothing! Even though it has logs going back as much as the other nodes. But at the same time, n1 is running right now and the range is quiescent (though n2 is down), at lease index 40738. So whatever is happening seems to be able to rectify itself.

Now node2 (jordan-tpcc3) actually has had the same problem:

roachprod ssh jordan-tpcc:1-3 "grep -hR 'r137[^0-9]' /home/jordan/logs/cockroach.jordan*" | grep -E '^[IWEF]' | cut -c '2-' | sort | less
80822 00:43:50.321460 2853927 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:16:38.955973 178 storage/replica_proposal.go:214  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] new range lease repl=(n2,s2):3 seq=12 start=1534896775.261418534,0 epo=10 pro=1534900589.436144470,0 following repl=(n2,s2):3 seq=12 start=1534896775.261418534,0 epo=10 pro=1534898195.434831308,0
180822 01:18:07.173010 3795734 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
180822 01:18:07.191900 3777809 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:18:07.195077 3667288 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:19:25.238767 3994102 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
180822 01:19:25.251414 4011402 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Put, 1 BeginTxn
180822 01:19:25.252120 3998862 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Gc
180822 01:19:25.258729 3996348 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Gc
180822 01:19:25.262334 3996347 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Gc
180822 01:19:25.266664 3992477 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 ResolveIntent
180822 01:19:25.269881 4011587 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Put, 1 BeginTxn
180822 01:19:30.857840 3643185 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:19:30.863610 3643283 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Put, 1 BeginTxn
180822 01:19:30.977283 3635709 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Gc
180822 01:19:31.020586 3635647 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Gc
180822 01:20:51.433381 4182930 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: could not GC completed transaction anchored at /Table/55/1/516/0: context canceled
180822 01:21:44.314235 192 storage/replica_proposal.go:214  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] new range lease repl=(n2,s2):3 seq=12 start=1534896775.261418534,0 epo=10 pro=1534900890.931861257,0 following repl=(n2,s2):3 seq=12 start=1534896775.261418534,0 epo=10 pro=1534900589.436144470,0
180822 01:21:45.548651 4330105 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:21:45.571948 4330104 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 Put, 1 BeginTxn
180822 01:21:50.296284 4291189 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:22:25.002480 4357555 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 01:22:25.003191 4400051 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 EndTxn
180822 01:23:40.534361 4352969 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: could not GC completed transaction anchored at /Table/55/1/815/0: result is ambiguous (context canceled)
180822 01:29:42.860901 5249180 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
180822 01:35:17.565893 5379639 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: could not GC completed transaction anchored at /Table/55/1/109/0: result is ambiguous (context canceled)
180822 01:41:28.072219 5712582 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
180822 02:14:18.540236 5908730 storage/intent_resolver.go:738  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] failed to cleanup transaction intents: failed to resolve intents: the batch experienced mixed success and failure
180822 02:24:11.041111 6792220 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 02:24:11.041397 6791097 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 PushTxn
180822 02:24:11.042937 6787605 storage/replica.go:3511  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] context canceled before proposing: 1 ResolveIntent
180822 02:26:10.370774 185 storage/replica.go:5401  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] unable to update the state machine: applied index jumped from 2443716 to 2443718
180822 02:26:10.374425 185 util/log/crash_reporting.go:477  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] Reported as error 19a3626e1f924077b8aef04045c9a6ee
180822 02:26:10.374443 185 storage/replica.go:5406  [n2,s2,r137/3:/Table/55/1/{0/0-1000}] applied index jumped from 2443716 to 2443718
180822 02:26:11.042668 176 storage/replica_proposal.go:214  [n1,s1,r137/1:/Table/55/1/{0/0-1000}] new range lease repl=(n1,s1):1 seq=13 start=1534904770.198981755,0 epo=12 pro=1534904770.198987805,0 following repl=(n2,s2):3 seq=12 start=1534896775.261418534,0 epo=10 pro=1534900890.931861257,0

It's interesting that it's apparently able to get a lease later.

For r604, something that caught my eye were these rapid repeated lease acquisitions (or lease acquisition attempts). I don't know if that's related, but definitely needs to be understood as well.

180821 14:27:40.959644 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.942824492,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861660.930587554,0
180821 14:27:40.973122 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.960146739,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861660.942824492,0
180821 14:27:40.988157 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.974917324,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861660.960146739,0
180821 14:27:41.001183 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861660.988554372,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861660.974917324,0
180821 14:27:41.015964 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.001552995,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861660.988554372,0
180821 14:27:41.031030 178 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.016203856,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.001552995,0
180821 14:27:41.047547 126 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.031225361,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.016203856,0
180821 14:27:41.061015 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.047872992,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.031225361,0
180821 14:27:41.074587 179 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.061302724,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.047872992,0
180821 14:27:41.087465 174 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.074901102,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.061302724,0
180821 14:27:41.102171 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.088295754,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.074901102,0
180821 14:27:41.119045 177 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.102585015,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.088295754,0
180821 14:27:41.135279 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.119410020,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.102585015,0
180821 14:27:41.170818 122 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.135530480,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.119410020,0
180821 14:27:41.188962 128 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.171110550,0 following repl=(n2,s2):3 seq=6 start=1534
861646.829035154,0 epo=2 pro=1534861661.135530480,0
180821 14:27:41.801610 176 storage/replica_proposal.go:214  [n2,s2,r604/3:/Table/54/{2/945/1…-3}] new range lease repl=(n2,s2):3 seq=6 start=1534861646.829035154,0 epo=2 pro=1534861661.189245834,0 following repl=(n2,s2):3 seq=6 start=1534

@jordanlewis please don't touch or restart n3, I'm out for a bit but will take a look at the raft log there.

tbg · 2018-08-22T09:48:17Z

applied index jumped from 2443716 to 2443718

Raft log shows that the last entry is an empty one at

/Local/RangeID/137/u/RaftLog/logIndex:2444130: Term:28 Index:2444130 : EMPTY

The log around the jump doesn't have a gap (as expected, or the other nodes would've also crashed):

/Local/RangeID/137/u/RaftLog/logIndex:2443716: Term:27 Index:2443716  by lease #12
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:12 max_lease_index:2439787 replicated_eval_result:<timestamp:<wall_time:1534904740554511745 > delta:<last_update_nanos:1534904740554511745 live_bytes:64 key_bytes:12 val_bytes:113 val_count:1 intent_bytes:61 intent_count:1 sys_bytes:160 sys_count:1 > > write_batch:<data:"\000\000\000\000\000\000\000\000\003\000\000\000\001\036\001k\022\277\211\357\210\000\001txn-C6\301\373\313\246F\301\243$\034-\262fw\\\000\202\001\022\004\010\000\020\000\030\000 \000(\0002t\366S\312\331\003\n*\n\020C6\301\373\313\246F\301\243$\034-\262fw\\\032\004\277\211\357\210*\n\010\201\253\302\206\253\367\304\246\0250\303\270\0228\002\022\007sql txn*\n\010\307\261\352\217\253\367\304\246\0252\n\010\201\253\302\206\253\367\304\246\025:\n\010\201\253\302\206\253\367\304\246\025B\016\010\002\022\n\010\201\253\302\206\253\367\304\246\025H\001r\000z\000\001\005\277\211\357\210\000@\n*\n\020C6\301\373\313\246F\301\243$\034-\262fw\\\032\004\277\211\357\210*\n\010\201\253\302\206\253\367\304\246\0250\303\270\0228\003\022\014\010\201\253\302\206\253\367\304\246\025\020\000\030\000 \014(1\001\016\277\211\357\210\000\025M\023\272\260\320\225\201\t1\036\025\255\271\n&\0016\026\00214\026\00212\026\00213\026\002XA\026\t648411111\025\004(\211\002v\025\0064\217\006d\201\311" >

/Local/RangeID/137/u/RaftLog/logIndex:2443717: Term:27 Index:2443717  by lease #12
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:12 max_lease_index:2439788 replicated_eval_result:<timestamp:<wall_time:1534904735574382535 > delta:<last_update_nanos:1534904735574382535 live_bytes:-66 val_bytes:-66 intent_bytes:-60 intent_count:-1 sys_bytes:129 > > write_batch:<data:"\000\000\000\000\000\000\000\000\002\000\000\000\000\007\277\211\367\003\210\210\000\001 \001k\022\277\211\367\003\210\210\000\001txn-?\034R\375\271\317@\224\233/\326\242\200#\374\222\000\205\002\022\004\010\000\020\000\030\000 \000(\0002\366\001\256\271\362\215\003\n,\n\020?\034R\375\271\317@\224\233/\326\242\200#\374\222\032\006\277\211\367\003\210\210*\n\010\307\257\347\277\230\367\304\246\0250\346\212\0108\016\022\007sql txn \001*\n\010\207\325\371\306\234\367\304\246\0252\n\010\307\257\347\277\230\367\304\246\025:\n\010\307\257\347\277\230\367\304\246\025B\016\010\002\022\n\010\307\257\347\277\230\367\304\246\025B\016\010\003\022\n\010\361\224\230\216\231\367\304\246\025H\001Z\t\032\007\275\211\367\003\210\222\210Z\n\032\010\302\211\367\003\210\222\332\210Z\030\032\026\303\211\022D\215\337\253\357\367E\236\264B\356\360\314\324I\331\000\001\210Z\034\032\032\303\212\367\003\210\222\022D\215\337\253\357\367E\236\264B\356\360\314\324I\331\000\001\210Z\035\032\033\303\213\367\003\210\222\332\022D\215\337\253\357\367E\236\264B\356\360\314\324I\331\000\001\210r\000z\000" >

/Local/RangeID/137/u/RaftLog/logIndex:2443718: Term:27 Index:2443718  by lease #12
proposer_replica:<node_id:2 store_id:2 replica_id:3 > proposer_lease_sequence:12 max_lease_index:2439789 replicated_eval_result:<timestamp:<wall_time:1534904727138676651 > delta:<last_update_nanos:1534904727138676651 live_bytes:67 key_bytes:12 val_bytes:116 val_count:1 intent_bytes:61 intent_count:1 sys_bytes:164 sys_count:1 > > write_batch:<data:"\000\000\000\000\000\000\000\000\003\000\000\000\001\037\001k\022\277\211\366\232\210\000\001txn-\"\024\344\230\322}Nu\235\001\313\252\227\364\320.\000\205\001\022\004\010\000\020\000\030\000 \000(\0002w\024\337\t+\003\n-\n\020\"\024\344\230\322}Nu\235\001\313\252\227\364\320.\032\005\277\211\366\232\210*\014\010\302\225\361\370\372\366\304\246\025\020\0020\262\35128\002\022\007sql txn*\n\010\336\373\251\221\253\367\304\246\0252\n\010\253\337\254\211\371\366\304\246\025:\n\010\343\252\250\236\371\366\304\246\025B\016\010\002\022\n\010\343\252\250\236\371\366\304\246\025H\001r\000z\000\001\006\277\211\366\232\210\000C\n-\n\020\"\024\344\230\322}Nu\235\001\313\252\227\364\320.\032\005\277\211\366\232\210*\014\010\302\225\361\370\372\366\304\246\025\020\0020\262\35128\003\022\014\010\253\337\254\211\371\366\304\246\025\020\000\030\000 \014(1\001\017\277\211\366\232\210\000\025M\023\267\221+/\253\t1\220+\320_\n&\00210\026\00210\026\00216\026\00212\026\002BJ\026\t822911111\025\003)\007=\025\0064\217\006\030$5" >

tbg · 2018-08-23T14:36:36Z

[deleted] because wrong thread. Wanted #28995.

tbg · 2018-08-27T18:36:01Z

As expected, I wasn't able to repro this by manually importing TPCC and running some COUNT(*)s, starting and stopping the workload, restarting nodes, etc. I dropped one incarnation of TPCC and set the GC ttl to 60s. The cluster seems to work fine but spews

W180827 18:35:01.513790 1408 sql/exec_util.go:902  [n1,client=127.0.0.1:54412,user=root] error executing schema change: context canceled
W180827 18:35:02.596877 1408 sql/exec_util.go:902  [n1,client=127.0.0.1:54412,user=root] error executing schema change: context canceled
W180827 18:35:03.622761 1408 sql/exec_util.go:902  [n1,client=127.0.0.1:54412,user=root] error executing schema change: context canceled
W180827 18:35:04.596877 1408 sql/exec_util.go:902  [n1,client=127.0.0.1:54412,user=root] error executing schema change: context canceled
W180827 18:35:05.477075 1408 sql/exec_util.go:902  [n1,client=127.0.0.1:54412,user=root] error executing schema change: context canceled

but there's no indication of an ongoing schema change. I think this is likely the DROP operation, though the event log doesn't say anything about that not having completed:

upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix cockroachdb#28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None

29574: storage: update import path for Raft r=petermattis a=tschottdorf upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix #28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None 29585: gossip: avoid allocation of UnresolvedAddr in getNodeIDAddressLocked r=nvanbenschoten a=nvanbenschoten `getNodeIDAddressLocked` is called from `Dialer.ConnHealth` and `Dialer.DialInternalClient`. It was responsible for **1.71%** of all allocations (`alloc_objects`) on a 3-node long-running cluster that was running TPC-C 1K. Pointing into `nd.LocalityAddress` is safe because even if the `NodeDescriptor` itself is replaced in `Gossip`, the struct is never internally mutated. This is the same reason why taking the address of `nd.Address` was already safe. Release note (performance improvement): Avoid allocation when checking RPC connection health. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

tbg · 2018-09-05T15:46:07Z

Downgrading, but leaving open. We also need roachtesting for this (I hope that this will fall out of Nikhil's upcoming merge queue work to reproduce #29252.

upstream moved from github.com/etcd to go.etcd.io/etcd. We're going to have to pick up an update soon to fix cockroachdb#28918, so it's easier to switch now. We're not picking up any new commits except for the renames. Release note: None

This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None

29579: storage: return one entry less in Entries r=petermattis a=tschottdorf This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes #28918. Release note: None 29631: cli: handle merged range descriptors in debug keys r=petermattis a=tschottdorf Noticed during #29252. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None

29332: roachtest: tpcc with chaos r=benesch a=tschottdorf This is an attempt to reproduce #28918. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

This works around the bug outlined in: etcd-io/etcd#10063 by matching Raft's internal implementation of commit pagination. Once the above PR lands, we can revert this commit (but I assume that it will take a little bit), and I think we should do that because the code hasn't gotten any nicer to look at. Fixes cockroachdb#28918. Release note: None # # Commit message recommendations: # # --- # <pkg>: <short description> # # <long description> # # Release note (category): <release note description> # --- # # Wrap long lines! 72 columns is best. # # The release note must be present if your commit has # user-facing changes. Leave the default above if not. # # Categories for release notes: # - cli change # - sql change # - admin ui change # - general change (e.g., change of required Go version) # - build change (e.g., compatibility with older CPUs) # - enterprise change (e.g., change to backup/restore) # - backwards-incompatible change # - performance improvement # - bug fix # # Commit message recommendations: # # --- # <pkg>: <short description> # # <long description> # # Release note (category): <release note description> # --- # # Wrap long lines! 72 columns is best. # # The release note must be present if your commit has # user-facing changes. Leave the default above if not. # # Categories for release notes: # - cli change # - sql change # - admin ui change # - general change (e.g., change of required Go version) # - build change (e.g., compatibility with older CPUs) # - enterprise change (e.g., change to backup/restore) # - backwards-incompatible change # - performance improvement # - bug fix

nvanbenschoten · 2019-08-02T05:41:57Z

@tbg can we close this issue?

jordanlewis added this to the 2.1 milestone Aug 21, 2018

jordanlewis assigned tbg Aug 21, 2018

benesch changed the title ~~core: applied index jumped from 26183 to 26185~~ storage: applied index jumped from 26183 to 26185 Aug 22, 2018

This was referenced Aug 22, 2018

storage: mvccstats inconsistencies during tpcc #28946

Closed

storage: very frequent near identical lease acquisition log messages #28947

Closed

tbg added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting and removed S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. labels Aug 27, 2018

craig bot closed this as completed in #29574 Sep 5, 2018

tbg removed the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Sep 5, 2018

tbg changed the title ~~storage: applied index jumped from 26183 to 26185~~ storage: pick up upstream fix for Raft entries handling Sep 5, 2018

tbg modified the milestones: 2.1, 2.2 Sep 5, 2018

tbg reopened this Sep 5, 2018

craig bot closed this as completed in #29579 Sep 6, 2018

tbg reopened this Sep 6, 2018

petermattis mentioned this issue Sep 17, 2018

panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260

Closed

craig bot pushed a commit that referenced this issue Oct 3, 2018

Merge #29332

7cda4be

29332: roachtest: tpcc with chaos r=benesch a=tschottdorf This is an attempt to reproduce #28918. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg mentioned this issue Oct 15, 2018

storage: divergent entry application can lead to request stalls on Range splits #31330

Closed

tbg mentioned this issue Oct 15, 2018

backport-2.0: storage: return one entry less in Entries #31351

Merged

tbg closed this as completed Aug 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: pick up upstream fix for Raft entries handling #28918

storage: pick up upstream fix for Raft entries handling #28918

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

benesch commented Aug 22, 2018

benesch commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

benesch commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

tbg commented Aug 22, 2018

tbg commented Aug 22, 2018 •

edited

Loading

tbg commented Aug 22, 2018

tbg commented Aug 23, 2018 •

edited

Loading

tbg commented Aug 27, 2018

tbg commented Sep 5, 2018

nvanbenschoten commented Aug 2, 2019

storage: pick up upstream fix for Raft entries handling #28918

storage: pick up upstream fix for Raft entries handling #28918

Comments

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018

jordanlewis commented Aug 21, 2018

jordanlewis commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

nvanbenschoten commented Aug 22, 2018

benesch commented Aug 22, 2018

benesch commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

benesch commented Aug 22, 2018

jordanlewis commented Aug 22, 2018

tbg commented Aug 22, 2018

tbg commented Aug 22, 2018 • edited Loading

tbg commented Aug 22, 2018

tbg commented Aug 23, 2018 • edited Loading

tbg commented Aug 27, 2018

tbg commented Sep 5, 2018

nvanbenschoten commented Aug 2, 2019

tbg commented Aug 22, 2018 •

edited

Loading

tbg commented Aug 23, 2018 •

edited

Loading