-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing election rate ramp up with time #904
Comments
@onsi Have you ever seen this in the previous version of etcd? |
No, we have not. And I'm quite confident that we've successfully seen etcd run for longer than 7 days with the older versions we've deployed. |
hey @xiangli-cmu It looks like it's starting to ramp up on that third environment I mentioned (the really large one). Is there anything diagnostic you'd like me to look at? I'm going to get memory and cpu usage metrics this time (I failed to do that last time. durp). |
sadly we aren't running with |
Another update. Our third (large) environment has hit the point (7 days!) where this became an issue. This one's real interesting. As you can see there is a long time period (over a month) where etcd did not show this issue. This is etcd 9af9438 (basically 0.3). The blue-dashed line is where we deployed v0.4.3 - the upgrade goes fine. However now with v0.4.3 we see that after 7 days of uptime the # of elections start to skyrocket. So -- looks like this was an issue introduced between ~0.3 and v0.4.3. Some more data:
I can send you these |
hey @unihorn we're not using the standby feature. the instances were swapping leadership roles (so leader <=> follower) I've uploaded the logs and trace dumps: the memory leak may well be due to #900 -- but:
Perhaps its garbage collection in the presence of millions of leaking timers that's causing the performance issue? Unclear... |
@onsi I check the |
I agree with your intuition. I'll try to validate it tomorrow by The Were you expecting to see a leaking goroutine for each timer? Do you know Onsi On Wed, Jul 23, 2014 at 8:13 PM, Yicheng Qin notifications@github.com
|
thanks @marc-barry has it been validated that the timer leak fix fixes the election issue? we onsi On Thu, Jul 24, 2014 at 12:29 PM, Marc Barry notifications@github.com
|
just ran an experiment to verify that the timer leak is causing this. I rigged etcd to leak a timer very frequently and saw elections set in after about 2.5e6 timer allocations. Based on the constants in etcd it's currently leaking 2.5 timers a second so it takes 1e6 seconds ~11 days to get to leader elections. Given my laptop is faster than the aws instances we run on I think this is in the right ballpark of what we're seeing (7 days). Gonna mark this as closed. |
@onsi So glad this could be solved! |
This is etcd v0.4.3 with a leader election timeout of 1s.
We had a cluster of 3 etcd nodes under relatively light (but constant) load with an uptime of 14 days.
We started to see some etcd performance issues and investigated. Turned out that etcd was performing an election every ~minute and was on its ~19,000th term.
None of the stats from the etcd stats endpoints seemed concerning. Our latencies looked like:
When we hit each node with a SIGQUIT we saw very many (~1000) goroutines that looked like:
The SIGQUIT subsequently killed each node (grrr...) when the node was restarted the system recovered and the elections were back under control.
We decided to dig deeper. We took our etcd logs and made a plot of election term vs time:
The x-axis is time in days. The y-axis is the term number. The vertical lines correspond to etcd cluster restart events (where we restarted etcd because we were redeploying our system). You'll notice two increasingly steep rises. One at day ~7, the other at day ~26. Interestingly the time between the restart at day 0 and day 19 and the rise in elections is roughly the same: 7 days.
The load on the system in question is not particularly high and is generally quite constant. In fact we're seeing similar behavior (and time constants) on two other environments that have dramatically less, and substantial more load.
One last thing. We didn't run
lsof
before SIGQUIT so, unfortunately, we don't know if we were leaking file descriptors. However, one of our environments has just hit the 7 day limit (and is just starting to show signs of election churn) andlsof
on that box looked normal.So, we don't think this is related to #880 -- but it might be since we are on v0.4.3.
Anyone seen this? Anything else we can mine for data?
The text was updated successfully, but these errors were encountered: