Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: increase default election timeout to 1200 ms #893

Closed
wants to merge 1 commit into from

Conversation

marineam
Copy link
Contributor

This is the value we have been using on EC2, OpenStack, and Rackspace.
Part of the reason why I have been able to trigger a bug in go-raft
seems to be the leader thrashing that is happening on my GCE test
clusters right now.

Mostly I am posting this to start a discussion on what the default
timeouts really should be, the current values are clearly not well
suited for virtual machines which is our primary target. The tuning
document describes determining values in terms of ping time but the ping
time between these GCE systems is under 1 ms.

So what should the values actually be? What should the tuning document
suggest as a metric for choosing more precise values?

This is the value we have been using on EC2, OpenStack, and Rackspace.
Part of the reason why I have been able to trigger a bug in go-raft
seems to be the leader thrashing that is happening on my GCE test
clusters right now.

Mostly I am posting this to start a discussion on what the default
timeouts really should be, the current values are clearly not well
suited for virtual machines which is our primary target. The tuning
document describes determining values in terms of ping time but the ping
time between these GCE systems is under 1 ms.

So what should the values actually be? What should the tuning document
suggest as a metric for choosing more precise values?
@marineam
Copy link
Contributor Author

@kelseyhightower pointed out that he was not able to get GCE f1-micro instances to work with any value, and that's what I'm triggering my problems on. So this change won't help those instances but I would like to know what if the 1200 value we are using on many systems is actually reasonable or not.

@kelseyhightower
Copy link
Contributor

@marineam That's correct. Now that I'm using the g1-small on GCE, normal operation work without issue and the last upgrade went smooth. All with defaults for etcd.

@marineam marineam closed this Jul 16, 2014
@philips
Copy link
Contributor

philips commented Jul 16, 2014

@kelseyhightower @marineam We should email the GCE folks to figure out what might be the problem.

@marineam
Copy link
Contributor Author

Moving the 1200ms question to coreos/bugs#76

@marineam marineam deleted the timeout branch July 16, 2014 04:38
@marineam
Copy link
Contributor Author

@philips I dunno, maybe we want GCE to keep those instances shitty since they have been a good test case for poking holes in go-raft and locksmith ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants