One dead node/lost connection to entire cluster during tpcc #31415

awoods187 · 2018-10-16T01:36:41Z

Describe the problem

In the middle of finishing the restore for tpcc, i lost connection to my nodes:

--- FAIL: tpccbench/nodes=24/cpu=16/partition (12413.42s) test.go:570,cluster.go:1318,tpcc.go:662,tpcc.go:333: pq: setting updated but timed out waiting to read new value FAIL

To Reproduce
Use roachtest from two days ago

Change test to:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 24,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 20000,
+                       EstimatedMax:   12000,
+                       LoadConfig:     singlePartitionedLoadgen,

Run:

bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
Passing tpc-c

Additional data / screenshots
After a few minutes everything is lost:

Environment:

CockroachDB version 2.1 Beta 1008

Logs 1-12
Logs.zip

The text was updated successfully, but these errors were encountered:

awoods187 · 2018-10-16T01:37:28Z

Logs 13-24
Logs 2.zip

awoods187 · 2018-10-16T01:44:04Z

Seeing lots of warnings:
W181015 23:57:23.541529 256905 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.46:26257: connect: connection refused". Reconnecting...

Seeing this error:
E181015 23:57:26.955261 256941 storage/queue.go:793 [n1,replicate,s1,r8554/1:/Table/57/1/115{2/8/-â�¦-4/3/-â�¦}] 3 matching stores are currently throttled

And this one:
E181015 23:05:57.681572 157652 storage/queue.go:793 [n1,replicate,s1,r18938/1:/Table/57/1/40{19/3/â�¦-88/6/â�¦}] change replicas of r18938 failed: descriptor changed: [expected] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0] != [actual] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0]: unexpected value: raw_bytes:"\302\233\213t\003\010\372\223\001\022\013\301\211\367\017\263\213\206\371\027\212\210\032\013\301\211\367\017\370\216\206\366}\211\210\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\002\"\006\010\002\020\002\030\004(\005" timestamp:<wall_time:1539644757654205944 >

awoods187 · 2018-10-16T11:55:25Z

andy-1539640780-tpccbench-nodes-24-cpu-16-partition

awoods187 · 2018-10-16T11:56:30Z

tbg · 2018-10-16T13:35:20Z

W181015 23:57:23.541529 256905 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.46:26257: connect: connection refused". Reconnecting...

This means that node 4 has crashed (or was shut down, but pretty sure it crashed)

E181015 23:57:26.955261 256941 storage/queue.go:793 [n1,replicate,s1,r8554/1:/Table/57/1/115{2/8/-â�¦-4/3/-â�¦}] 3 matching stores are currently throttled

This is not desirable, but can happen when there is a lot of rebalancing activity. Not sure this really ought to be an error...

E181015 23:05:57.681572 157652 storage/queue.go:793 [n1,replicate,s1,r18938/1:/Table/57/1/40{19/3/â�¦-88/6/â�¦}] change replicas of r18938 failed: descriptor changed: [expected] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0] != [actual] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0]: unexpected value: raw_bytes:"\302\233\213t\003\010\372\223\001\022\013\301\211\367\017\263\213\206\371\027\212\210\032\013\301\211\367\017\370\216\206\366}\211\210"\006\010\001\020\001\030\001"\006\010\004\020\004\030\002"\006\010\002\020\002\030\004(\005" timestamp:<wall_time:1539644757654205944 >

This is also expected under heavy rebalancing, but again I'm not sure it belongs in the logs in this way.

pq: setting updated but timed out waiting to read new value FAIL

The timeout used for setting a cluster setting was way too short. @nvanbenschoten increased it in the commit below, which you maybe don't have in your version yet (it landed 6 days ago).

7a79a38

I think the cluster name you gave above is incorrect. I see n4 still running on that cluster.

awoods187 · 2018-10-16T13:40:05Z

No it's defintiely not incorrect. I have two 24 node clusters up--this one has one dead node and the other one (in #31458) has two dead nodes.

tbg · 2018-10-16T13:44:32Z

Ah, they're interchanged. Look here:

failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0

andy-1539640780-tpccbench-nodes-24-cpu-16-partition

Time to take a look at andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004.

tbg · 2018-10-16T13:49:28Z

n4 oomed. The heap profile looks exactly like #31409 (comment). Folding into that issue.

awoods187 changed the title ~~Lost connection to entire cluster during tpcc~~ One dead node/lost connection to entire cluster during tpcc Oct 16, 2018

tbg closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One dead node/lost connection to entire cluster during tpcc #31415

One dead node/lost connection to entire cluster during tpcc #31415

awoods187 commented Oct 16, 2018 •

edited

Loading

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

One dead node/lost connection to entire cluster during tpcc #31415

One dead node/lost connection to entire cluster during tpcc #31415

Comments

awoods187 commented Oct 16, 2018 • edited Loading

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018 •

edited

Loading