Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make raft snapshot commit threshold configurable #4105

Merged
merged 5 commits into from
May 11, 2018

Conversation

preetapan
Copy link
Contributor

This PR adds raft_snapshot_threshold as a new configurable param to pass down to the raft layer. This was previously always set to 8192 and not modifiable by Consul operators. This change should help operators of Consul in larger installations with lots of writes to have better control over how often snapshots are taken.

@preetapan preetapan added this to the 1.1.0 milestone May 10, 2018
@preetapan preetapan requested review from banks and kyhavlov May 10, 2018 15:25
@@ -448,9 +448,12 @@ func DefaultConfig() *Config {
// Disable shutdown on removal
conf.RaftConfig.ShutdownOnRemove = false

// Check every 5 seconds to see if there are enough new entries for a snapshot
// Check every 5 seconds to see if there are enough new entries for a snapshot, can be overridden
conf.RaftConfig.SnapshotInterval = 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should bump this back up again to something like 30-60s just to make things a little nicer in the case where a busy cluster hasn't configured this and gets really frequent snapshots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan was to set this to 30 seconds and the threshold to 16384

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to block this if there is consensus already but some thoughts inline.

@@ -267,6 +269,7 @@ type Consul struct {
ElectionTimeout *string `json:"election_timeout,omitempty" hcl:"election_timeout" mapstructure:"election_timeout"`
HeartbeatTimeout *string `json:"heartbeat_timeout,omitempty" hcl:"heartbeat_timeout" mapstructure:"heartbeat_timeout"`
LeaderLeaseTimeout *string `json:"leader_lease_timeout,omitempty" hcl:"leader_lease_timeout" mapstructure:"leader_lease_timeout"`
SnapshotThreshold *int `json:"snapshot_threshold,omitempty" hcl:"snapshot_threshold" mapstructure:"snapshot_threshold"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you tested this so there is some other place the interval is passed to the server but just thought I'd mention in case we missed updating this to add Interval too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this can be removed. This specific struct is for config values that are controlled/changed via raftMultiplier


* <a name="_raft_snapshot_interval"></a><a href="#_raft_snapshot_interval">`-raft-snapshot-interval`</a> - This
controls how often servers check if they need to save a snapshot to disk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should actually not document these immediately.

I know Armon was concerned about them being kinda low level and as soon as we document them we are really committed to keeping them working/available. Once we have proper WAL we shouldn't really need to allow user to tune these.

I think if we do keep them though we should be more explicit about when you'd need to change them and what effect they have.

Something like:

  • -raft-snapshot-threshold - This controls the minimum number of raft commit entries between snapshots that are saved to disk. This is a low-level parameter that should rarely need to be changed. Very busy clusters experiencing excessive disk IO because the servers are constantly snapshotting may increase this to reduce disk IO and increase the chance that multiple servers are not snapshotting at the same time. Increasing this trades off disk IO for disk space since the log will grow much larger and the space in the raft.db file can never be reclaimed. Servers may take longer to recover from crashes or failover if this is increased significantly as more logs will need to be replayed.

  • -raft-snapshot-interval - This controls how often servers check if they need to save a snapshot to disk. This is a low-level parameter that should rarely need to be changed. Very busy clusters experiencing excessive disk IO because the servers are constantly snapshotting may increase this to reduce disk IO and increase the chance that multiple servers are not snapshotting at the same time. Increasing this trades off disk IO for disk space since the log will grow much larger and the space in the raft.db file can never be reclaimed. Servers may take longer to recover from crashes or failover if this is increased significantly as more logs will need to be replayed

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@preetapan preetapan force-pushed the f-raft-threshold-config branch from 0fcf701 to ca67094 Compare May 11, 2018 15:46
@preetapan preetapan merged commit 4c2c4c8 into master May 11, 2018
@preetapan preetapan deleted the f-raft-threshold-config branch May 11, 2018 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants