-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking is causing cluster to become unresponsive #7471
Comments
Can you double-check with other client requests?
We use this tool quite a lot, but I've never had this issue. |
@davissp14 Can you provide us detailed steps to reproduce this issue (including your platform information) |
I am running 3 nodes in a containerized environment (LXC) fronted with 2 etcd proxies, each node is on a separate ec2 host running Ubuntu 14.04. The 3 member nodes are running with 256MB memory (limit set by cgroups) and has no CPU / Disk / IOPS limits in place. The only node that appears to lock up is the leader. Commands run just fine on all non-leader nodes, while commands on the leader hang indefinitely. I am able to reproduce the issue consistency by simply running the benchmark command against an empty Etcd on the setup described above. **~ Boot Info ~** **~ Strace of hanging leader node~**
|
I went ahead and updated to 3.1.2 to see if it helped. While it's not hanging like it was in 3.1.0, I am now seeing a panic.
|
Which commands? Do you mean benchmark commands or etcdctl commands? Trying to figure out whether the etcd server is hanging or benchmark commands. |
@davissp14 Is this cluster totally fresh? That path would only crash if auth is enabled... Is the hang sensitive to the arguments? Does
Commands sent to the leader node or commands running on the leader node? Thanks! |
@gyuho Both the benchmark commands as well as the etcdctl commands hang on the leader node in 3.1.0. I am able to write to the non-leader nodes without issues, which seems strange. It appears that the server-to-server communication running on 2380 is working just fine, it's just client requests on 2379 are being blocked. |
The last benchmark I ran actually managed to throw 2 of the 3 nodes into an unresponsive state (leader included). I am still able to read and write to the responsive node without problems.
After reporting that the 1 responsive node is healthy, the @heyitsanthony This was a fresh cluster with auth enabled. I will play with the UPDATE |
Here is the output of
|
@davissp14 Seems like a but in Auth layer. Can you try to disable auth to see if it happens again? |
There's a deadlock in |
@xiang90 Confirmed it works just fine when disabling Auth. |
Just use the mutex instead. Fixes etcd-io#7471
Just use the mutex instead. Fixes etcd-io#7471
Just use the mutex instead. Fixes etcd-io#7471
Running into a very strange issue when benchmarking my ETCD 3.1.0 3 node cluster.
There has been a few times where this completes without issue, but most of the time it ends up putting my cluster into an unresponsive state, and the benchmarking progress comes to a halt. There's no logs that indicate an issue, it simply just becomes unresponsive until I finally start restarting nodes.
The one thing that seems to be consistent is the logs before becoming unresponsive.
Please let me know if I can provide you any additional information.
The text was updated successfully, but these errors were encountered: