-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One dead node/lost connection to entire cluster during tpcc #31415
Comments
Logs 13-24 |
Seeing lots of warnings: Seeing this error: And this one: |
andy-1539640780-tpccbench-nodes-24-cpu-16-partition |
This means that node 4 has crashed (or was shut down, but pretty sure it crashed)
This is not desirable, but can happen when there is a lot of rebalancing activity. Not sure this really ought to be an error...
This is also expected under heavy rebalancing, but again I'm not sure it belongs in the logs in this way.
The timeout used for setting a cluster setting was way too short. @nvanbenschoten increased it in the commit below, which you maybe don't have in your version yet (it landed 6 days ago). I think the cluster name you gave above is incorrect. I see n4 still running on that cluster. |
No it's defintiely not incorrect. I have two 24 node clusters up--this one has one dead node and the other one (in #31458) has two dead nodes. |
Ah, they're interchanged. Look here:
Time to take a look at andy-1539640859-tpccbench-nodes-6-cpu-16-partition-0004. |
n4 oomed. The heap profile looks exactly like #31409 (comment). Folding into that issue. |
Describe the problem
In the middle of finishing the restore for tpcc, i lost connection to my nodes:
--- FAIL: tpccbench/nodes=24/cpu=16/partition (12413.42s) test.go:570,cluster.go:1318,tpcc.go:662,tpcc.go:333: pq: setting updated but timed out waiting to read new value FAIL
To Reproduce
Use roachtest from two days ago
Change test to:
Run:
bin/roachtest bench '^tpccbench/nodes=24/cpu=16/partition$$' --wipe=false --user=andy
Expected behavior
Passing tpc-c
Additional data / screenshots
After a few minutes everything is lost:
Environment:
Logs 1-12
Logs.zip
The text was updated successfully, but these errors were encountered: