Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One dead node/lost connection to entire cluster during tpcc #31415

Closed
awoods187 opened this issue Oct 16, 2018 · 8 comments
Closed

One dead node/lost connection to entire cluster during tpcc #31415

awoods187 opened this issue Oct 16, 2018 · 8 comments

Comments

@awoods187
Copy link
Contributor

awoods187 commented Oct 16, 2018

Describe the problem

In the middle of finishing the restore for tpcc, i lost connection to my nodes:
image

--- FAIL: tpccbench/nodes=24/cpu=16/partition (12413.42s) test.go:570,cluster.go:1318,tpcc.go:662,tpcc.go:333: pq: setting updated but timed out waiting to read new value FAIL

To Reproduce
Use roachtest from two days ago

Change test to:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 24,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 20000,
+                       EstimatedMax:   12000,
+                       LoadConfig:     singlePartitionedLoadgen,

Run:

bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
Passing tpc-c

Additional data / screenshots
After a few minutes everything is lost:
image

Environment:

  • CockroachDB version 2.1 Beta 1008

Logs 1-12
Logs.zip

@awoods187
Copy link
Contributor Author

Logs 13-24
Logs 2.zip

@awoods187
Copy link
Contributor Author

Seeing lots of warnings:
W181015 23:57:23.541529 256905 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.46:26257: connect: connection refused". Reconnecting...

Seeing this error:
E181015 23:57:26.955261 256941 storage/queue.go:793 [n1,replicate,s1,r8554/1:/Table/57/1/115{2/8/-�-4/3/-�}] 3 matching stores are currently throttled

And this one:
E181015 23:05:57.681572 157652 storage/queue.go:793 [n1,replicate,s1,r18938/1:/Table/57/1/40{19/3/�-88/6/�}] change replicas of r18938 failed: descriptor changed: [expected] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0] != [actual] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0]: unexpected value: raw_bytes:"\302\233\213t\003\010\372\223\001\022\013\301\211\367\017\263\213\206\371\027\212\210\032\013\301\211\367\017\370\216\206\366}\211\210\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\002\"\006\010\002\020\002\030\004(\005" timestamp:<wall_time:1539644757654205944 >

@awoods187
Copy link
Contributor Author

andy-1539640780-tpccbench-nodes-24-cpu-16-partition

@awoods187 awoods187 changed the title Lost connection to entire cluster during tpcc One dead node/lost connection to entire cluster during tpcc Oct 16, 2018
@awoods187
Copy link
Contributor Author

screen shot 2018-10-16 at 7 56 07 am

@tbg
Copy link
Member

tbg commented Oct 16, 2018

W181015 23:57:23.541529 256905 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.142.0.46:26257: connect: connection refused". Reconnecting...

This means that node 4 has crashed (or was shut down, but pretty sure it crashed)

E181015 23:57:26.955261 256941 storage/queue.go:793 [n1,replicate,s1,r8554/1:/Table/57/1/115{2/8/-�-4/3/-�}] 3 matching stores are currently throttled

This is not desirable, but can happen when there is a lot of rebalancing activity. Not sure this really ought to be an error...

E181015 23:05:57.681572 157652 storage/queue.go:793 [n1,replicate,s1,r18938/1:/Table/57/1/40{19/3/�-88/6/�}] change replicas of r18938 failed: descriptor changed: [expected] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n3,s3):3, (n2,s2):4, next=5, gen=0] != [actual] r18938:/Table/57/1/40{19/3/-1769/2/0-88/6/-2435/1/0} [(n1,s1):1, (n4,s4):2, (n2,s2):4, next=5, gen=0]: unexpected value: raw_bytes:"\302\233\213t\003\010\372\223\001\022\013\301\211\367\017\263\213\206\371\027\212\210\032\013\301\211\367\017\370\216\206\366}\211\210"\006\010\001\020\001\030\001"\006\010\004\020\004\030\002"\006\010\002\020\002\030\004(\005" timestamp:<wall_time:1539644757654205944 >

This is also expected under heavy rebalancing, but again I'm not sure it belongs in the logs in this way.

pq: setting updated but timed out waiting to read new value FAIL

The timeout used for setting a cluster setting was way too short. @nvanbenschoten increased it in the commit below, which you maybe don't have in your version yet (it landed 6 days ago).

7a79a38

I think the cluster name you gave above is incorrect. I see n4 still running on that cluster.

@awoods187
Copy link
Contributor Author

No it's defintiely not incorrect. I have two 24 node clusters up--this one has one dead node and the other one (in #31458) has two dead nodes.

@tbg
Copy link
Member

tbg commented Oct 16, 2018

Ah, they're interchanged. Look here:

failed to connect to {andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004:26257 0

andy-1539640780-tpccbench-nodes-24-cpu-16-partition

Time to take a look at andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004.

@tbg
Copy link
Member

tbg commented Oct 16, 2018

n4 oomed. The heap profile looks exactly like #31409 (comment). Folding into that issue.

@tbg tbg closed this as completed Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants