config: increase default election timeout to 1200 ms #893

marineam · 2014-07-16T04:11:17Z

This is the value we have been using on EC2, OpenStack, and Rackspace.
Part of the reason why I have been able to trigger a bug in go-raft
seems to be the leader thrashing that is happening on my GCE test
clusters right now.

Mostly I am posting this to start a discussion on what the default
timeouts really should be, the current values are clearly not well
suited for virtual machines which is our primary target. The tuning
document describes determining values in terms of ping time but the ping
time between these GCE systems is under 1 ms.

So what should the values actually be? What should the tuning document
suggest as a metric for choosing more precise values?

This is the value we have been using on EC2, OpenStack, and Rackspace. Part of the reason why I have been able to trigger a bug in go-raft seems to be the leader thrashing that is happening on my GCE test clusters right now. Mostly I am posting this to start a discussion on what the default timeouts really should be, the current values are clearly not well suited for virtual machines which is our primary target. The tuning document describes determining values in terms of ping time but the ping time between these GCE systems is under 1 ms. So what should the values actually be? What should the tuning document suggest as a metric for choosing more precise values?

marineam · 2014-07-16T04:22:04Z

@kelseyhightower pointed out that he was not able to get GCE f1-micro instances to work with any value, and that's what I'm triggering my problems on. So this change won't help those instances but I would like to know what if the 1200 value we are using on many systems is actually reasonable or not.

kelseyhightower · 2014-07-16T04:26:03Z

@marineam That's correct. Now that I'm using the g1-small on GCE, normal operation work without issue and the last upgrade went smooth. All with defaults for etcd.

philips · 2014-07-16T04:37:09Z

@kelseyhightower @marineam We should email the GCE folks to figure out what might be the problem.

marineam · 2014-07-16T04:38:21Z

Moving the 1200ms question to coreos/bugs#76

marineam · 2014-07-16T04:42:33Z

@philips I dunno, maybe we want GCE to keep those instances shitty since they have been a good test case for poking holes in go-raft and locksmith ;-)

marineam closed this Jul 16, 2014

marineam deleted the timeout branch July 16, 2014 04:38

arohter mentioned this pull request Jul 17, 2014

Leadership thrashing and heartbeat near election timeout warnings #868

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config: increase default election timeout to 1200 ms #893

config: increase default election timeout to 1200 ms #893

marineam commented Jul 16, 2014

marineam commented Jul 16, 2014

kelseyhightower commented Jul 16, 2014

philips commented Jul 16, 2014

marineam commented Jul 16, 2014

marineam commented Jul 16, 2014

config: increase default election timeout to 1200 ms #893

config: increase default election timeout to 1200 ms #893

Conversation

marineam commented Jul 16, 2014

marineam commented Jul 16, 2014

kelseyhightower commented Jul 16, 2014

philips commented Jul 16, 2014

marineam commented Jul 16, 2014

marineam commented Jul 16, 2014