-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leadership thrashing and heartbeat near election timeout warnings #868
Comments
@marc-barry |
While typing this up the leadership changed many times to different machines. Thus, keep this in mind when reading the results from the self stats. Each machine has 32 cores with 96 GB of RAM. Here is the CPU utilization as reported by
Here is the output from the
Here are the results of collecting these stats from each node. Note that there were leadership changes while I was collecting the following results. node01 {
"name":"node01",
"state":"follower",
"startTime":"2014-06-23T17:08:44.530204794Z",
"leaderInfo":{
"leader":"node03",
"uptime":"1m29.278701218s",
"startTime":"2014-06-25T12:22:41.775389245Z"
},
"recvAppendRequestCnt":3009522,
"recvPkgRate":20.100565290971556,
"recvBandwidthRate":777.8918767605992,
"sendAppendRequestCnt":406670
} node02 {
"name":"node02",
"state":"follower",
"startTime":"2014-06-23T17:07:12.11602086Z",
"leaderInfo":{
"leader":"node01",
"uptime":"46.824809849s",
"startTime":"2014-06-25T12:26:41.045445911Z"
},
"recvAppendRequestCnt":2875177,
"recvPkgRate":20.100681162164545,
"recvBandwidthRate":777.896360975768,
"sendAppendRequestCnt":963892
} node03 {
"name":"node03",
"state":"follower",
"startTime":"2014-06-23T17:05:52.575371888Z",
"leaderInfo":{
"leader":"node01",
"uptime":"407.642669ms",
"startTime":"2014-06-25T12:30:12.124416457Z"
},
"recvAppendRequestCnt":2693330,
"recvPkgRate":19.677688628106743,
"recvBandwidthRate":763.4943187705417,
"sendAppendRequestCnt":1714171
} node04 {
"name":"node04",
"state":"follower",
"startTime":"2014-06-23T17:02:29.035340443Z",
"leaderInfo":{
"leader":"node05",
"uptime":"32.372670147s",
"startTime":"2014-06-25T12:30:42.875519676Z"
},
"recvAppendRequestCnt":2037376,
"recvPkgRate":20.100540445009845,
"recvBandwidthRate":777.9914179241059,
"sendAppendRequestCnt":4358807
} node05 {
"name":"node05",
"state":"leader",
"startTime":"2014-06-23T17:11:27.907702227Z",
"leaderInfo":{
"leader":"node05",
"uptime":"1m25.597956104s",
"startTime":"2014-06-25T12:30:42.875152498Z"
},
"recvAppendRequestCnt":1863075,
"sendAppendRequestCnt":5016584,
"sendPkgRate":81.62929076739682,
"sendBandwidthRate":3299.4559328181795
}
Your welcome. Thanks for the help. |
The network and disk IO could hardly be the bottleneck of this problem. The utilization of both is low enough. Could you also try to do
? |
Sure. Here are the results:
{
"leader":"node03",
"followers":{
"node01":{
"latency":{
"current":0.650869,
"average":0.9658280053054744,
"standardDeviation":3.2272089958544545,
"minimum":0.291784,
"maximum":398.90756
},
"counts":{
"fail":2,
"success":502876
}
},
"node02":{
"latency":{
"current":0.891281,
"average":1.059197478397645,
"standardDeviation":3.599525384820501,
"minimum":0.294793,
"maximum":337.68265
},
"counts":{
"fail":6,
"success":502816
}
},
"node04":{
"latency":{
"current":1.084962,
"average":1.01607957382241,
"standardDeviation":3.4257407723845104,
"minimum":0.272535,
"maximum":337.845014
},
"counts":{
"fail":0,
"success":502842
}
},
"node05":{
"latency":{
"current":0.963331,
"average":3.223443277597172,
"standardDeviation":7.651197977242439,
"minimum":0.319269,
"maximum":337.696842
},
"counts":{
"fail":0,
"success":502804
}
}
}
}
{
"getsSuccess":1917817,
"getsFail":3581221,
"setsSuccess":905023,
"setsFail":730,
"deleteSuccess":4737,
"deleteFail":4882,
"updateSuccess":597859,
"updateFail":339,
"createSuccess":3450,
"createFail":43551,
"compareAndSwapSuccess":0,
"compareAndSwapFail":0,
"compareAndDeleteSuccess":1945,
"compareAndDeleteFail":26,
"expireCount":15,
"watchers":0
} node02 {
"getsSuccess":1599914,
"getsFail":2475617,
"setsSuccess":905023,
"setsFail":730,
"deleteSuccess":4737,
"deleteFail":4882,
"updateSuccess":597871,
"updateFail":339,
"createSuccess":3450,
"createFail":43551,
"compareAndSwapSuccess":0,
"compareAndSwapFail":0,
"compareAndDeleteSuccess":1945,
"compareAndDeleteFail":26,
"expireCount":15,
"watchers":1
} node03 {
"getsSuccess":1907889,
"getsFail":2981538,
"setsSuccess":905023,
"setsFail":730,
"deleteSuccess":4737,
"deleteFail":4882,
"updateSuccess":597898,
"updateFail":339,
"createSuccess":3450,
"createFail":43551,
"compareAndSwapSuccess":0,
"compareAndSwapFail":0,
"compareAndDeleteSuccess":1945,
"compareAndDeleteFail":26,
"expireCount":487,
"watchers":0
} node04 {
"getsSuccess":1914088,
"getsFail":2872526,
"setsSuccess":905023,
"setsFail":730,
"deleteSuccess":4737,
"deleteFail":4882,
"updateSuccess":597913,
"updateFail":339,
"createSuccess":3450,
"createFail":43551,
"compareAndSwapSuccess":0,
"compareAndSwapFail":0,
"compareAndDeleteSuccess":1945,
"compareAndDeleteFail":26,
"expireCount":487,
"watchers":2
} node05 {
"getsSuccess":1697718,
"getsFail":2567181,
"setsSuccess":905023,
"setsFail":730,
"deleteSuccess":4737,
"deleteFail":4882,
"updateSuccess":597925,
"updateFail":339,
"createSuccess":3450,
"createFail":43551,
"compareAndSwapSuccess":0,
"compareAndSwapFail":0,
"compareAndDeleteSuccess":1945,
"compareAndDeleteFail":26,
"expireCount":472,
"watchers":3
} I am also pretty sure that out watches have stopped firing. We use |
I don't think marc has mentioned that etcd CPU usage appears to grow steadily. Since he posted this yesterday, our etcd processes on the 5 nodes have grown to 8.0%, 8.4%, 7.7%, 7.4%, and 7.8% CPU usage. Each number is quite stable and doesn't fluctuate much on a second-to-second basis. When we upgraded from 0.4.1 to 0.4.3 I noticed at least one etcd process with over 21% CPU usage. |
Burke is correct. Checkout this CPU usage. All nodes are over 30% for CPU usage. node01:
node02:
node03:
node04:
node05:
Here is a {
"leader":"node04",
"followers":{
"node01":{
"latency":{
"current":1.054622,
"average":1.0004305487068865,
"standardDeviation":4.018218436601287,
"minimum":0.264388,
"maximum":472.419893
},
"counts":{
"fail":67,
"success":1794038
}
},
"node02":{
"latency":{
"current":0.638996,
"average":1.0337069312046558,
"standardDeviation":5.082345637648887,
"minimum":0.240994,
"maximum":472.119836
},
"counts":{
"fail":61,
"success":1793304
}
},
"node03":{
"latency":{
"current":0.702603,
"average":0.9834211504224556,
"standardDeviation":3.366540155673035,
"minimum":0.273594,
"maximum":452.885206
},
"counts":{
"fail":17,
"success":1794799
}
},
"node05":{
"latency":{
"current":0.863074,
"average":3.3656152790531073,
"standardDeviation":8.571406992641448,
"minimum":0.28737,
"maximum":476.095643
},
"counts":{
"fail":69,
"success":1793210
}
}
}
} |
We are now seeing a new leader elected at a rate of more than 1 per minute. Actually, I would say that it is happening at a rate of 2-3 times per minute. |
@marc-barry @burke The maximum latency is surprisingly high, and this is the root cause of frequent leader change. The latency is calculated in this way: make time points before and after @xiangli-cmu @philips Ideas? |
We really don't know why this is the case. The etcd clients are remaining constant. There are a few watches with The performance issues (including CPU utilization) also happened with 0.4.1. I was hoping an upgrade from 0.4.1 to 0.4.3 would make the problem go away but it didn't. |
On one hand, with time split between |
I grabbed a list of running goroutines about a minute apart:
...and a diff: 17d16
< 349234 waiting runtime.park
45a45,59
> 348428 waiting runtime.park
> 287810 waiting runtime.park
> 323905 waiting runtime.park
> 253131 waiting runtime.park
> 323556 waiting runtime.park
> 273845 waiting runtime.park
> 253307 waiting runtime.park
> 340095 waiting runtime.park
> 146799 waiting runtime.park
> 282403 waiting runtime.park
> 146800 waiting runtime.park
> 212194 waiting runtime.park
> 302501 waiting runtime.park
> 242746 waiting runtime.park
> 279429 waiting runtime.park
60d73
< 347734 waiting runtime.park
65d77
< 349378 waiting runtime.park
77c89
< 343936 waiting runtime.park
---
> 349566 waiting runtime.park
80a93
> 349547 waiting runtime.park ...so there doesn't appear to be an unreasonable number of goroutines running concurrently, and the IDs aren't high enough to indicate a problem with really fast goroutine creation/termination. |
I take it back; it doesn't always grows stably: Last restart was June 23, and CPU usage currently ranges from 12.6-14.0% across nodes. |
Here is the CPU usage across all five nodes in the cluster since we started monitoring. It looks like it is increasing linear over time. |
I restarted the etcd service on one of the machines last night and this is the CPU utilization profile (over about 12 hours). This doesn't seem normal. |
Here is some profile data for one of the nodes. The profiling happened over a longer period (i.e. a few hours):
|
We are going to try bumping |
@marc-barry We are working on this issue. Can you give me your email address? You can send me an email at xiang.li@coreos.com. Thanks! |
@xiangli-cmu We are seeing the same issue with 0.4.2. CPU usage is 100% and the leader elections are occurring several times per minute. Let me know if you need any more data on this. |
Same problem here. Brand new 3-node v0.4.5 EC2 cluster; no actual clients / usage. Seems related: #893 |
Cool. We'll try compiling latest master tomorrow and give it a spin! |
@xiangli-cmu @arohter @klizhentas #900 seems to have solved this issue for us. CPU utilization no longer grows linearly with time. Leadership does change a few (i.e. about 4) times per day. We are currently running bdeb96b until this makes its way into a release. |
Any ideas on when there will be a new release with this fix? Our CoreOS clusters are badly being affected by it. |
@jeffbaier It should show in alpha in a few days after I ship etcd 0.4.6 today. |
I am going to close this out based on @marc-barry 's testing of #900. This fix will be in etcd 0.4.6. |
While the fix in #900 definitely helped reduce the (re)election rate (and cpu time), we're still seeing quite a bit of leader fragility and flapping. #915 is very similar. We're seeing hundreds of election cycles when doing relatively low amounts of writes, or really anything that puts a bit of real load on the system. It kind of feels like the heartbeating is being blocked/stalled waiting for the etcd process to complete some work. |
I apologize for my lack of comments as I have been on vacation. As @arohter mentioned #900 has definitely helped. Our etcd cluster peers have been running the code from #900 since July 21 without restarting. We too are seeing fragility and flapping in leadership. As well, the CPU utilization has still been growing linearly over time. Here is a graph of the clusters CPU utilization from Jul 21 to August 13. The trend is pretty evident. I am going to dump the CPU profile tomorrow and see what etcd has been doing. I think #915 is the the same as the issue we are seeing now. We have |
@marc-barry We were seeing a similar situation where we would have steady CPU usage increase until the entire cluster was pegging the CPU on the boxes. What fixed it for us was increasing the |
I have also increased the heartbeat timeout and the peer election timeout. But I didn't increase it as drastically (from the defaults) as you did. I imagine that doing what you did will in fact stop the leadership elections and may postpone the CPU increase. But there is still something not quite right. I am unsure of how 5 peers which are on very fast machines within the same rack can see such high delays. We have already verified with packet capture that the delays aren't network related. I need to better understand some comments in the tuning guide (https://github.com/coreos/etcd/blob/master/Documentation/tuning.md) such as
Does this mean that increasing the heartbeat as you have will result in a cluster which has longer periods of time where a peer (follower) can be inconsistent with the leader? |
I have the same issue with 440 alpha in a cluster of 3 right now. |
I have the same issue on v0.4.6 with a 9 node peer cluster with lots of standbys (~35). We are running a cross data-centre cluster with heartbeat_timeout=400 and election_timeout=6000. etcd hits 100% CPU on the leader node. |
We are running
etcd
in production on a cluster of 5 machines. These are fairly powerful machines which aren't heavily loaded. 4 of the machines are in the same rack. The other is in the same data center. Ping times between the 5 node cluster don't exceed 0.3 ms.We are currently running version v0.4.3 on all nodes.
We have been seeing issues whereby leadership changes occur many times in an hour. Other times, leadership remains constant for a longer period of time. But generally, leadership changes more than once per day.
The following is a log from one of our nodes. A message like the following appears quite often and it is occurring on all nodes:
warning: heartbeat near election timeout: 230.012467ms
. Note that the log capture below even shows a leadership change event.In terms of the interval and timeouts these are the settings we are using (on all nodes):
-peer-heartbeat-interval=50
-peer-election-timeout=250
We have performed packet capture between the nodes and they are (at the networking layer) sending and receiving the heartbeats at a 50 ms interval.
As of right now etcd has not been stable for us. This has caused issues for out etcd clients. I am trying to figure out why this is happening and what next steps I can take to troubleshoot the issue and stabilize our etcd production cluster.
The text was updated successfully, but these errors were encountered: