-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make raft snapshot commit threshold configurable #4105
Conversation
agent/consul/config.go
Outdated
@@ -448,9 +448,12 @@ func DefaultConfig() *Config { | |||
// Disable shutdown on removal | |||
conf.RaftConfig.ShutdownOnRemove = false | |||
|
|||
// Check every 5 seconds to see if there are enough new entries for a snapshot | |||
// Check every 5 seconds to see if there are enough new entries for a snapshot, can be overridden | |||
conf.RaftConfig.SnapshotInterval = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should bump this back up again to something like 30-60s just to make things a little nicer in the case where a busy cluster hasn't configured this and gets really frequent snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan was to set this to 30 seconds and the threshold to 16384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to block this if there is consensus already but some thoughts inline.
agent/config/config.go
Outdated
@@ -267,6 +269,7 @@ type Consul struct { | |||
ElectionTimeout *string `json:"election_timeout,omitempty" hcl:"election_timeout" mapstructure:"election_timeout"` | |||
HeartbeatTimeout *string `json:"heartbeat_timeout,omitempty" hcl:"heartbeat_timeout" mapstructure:"heartbeat_timeout"` | |||
LeaderLeaseTimeout *string `json:"leader_lease_timeout,omitempty" hcl:"leader_lease_timeout" mapstructure:"leader_lease_timeout"` | |||
SnapshotThreshold *int `json:"snapshot_threshold,omitempty" hcl:"snapshot_threshold" mapstructure:"snapshot_threshold"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure you tested this so there is some other place the interval is passed to the server but just thought I'd mention in case we missed updating this to add Interval too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this can be removed. This specific struct is for config values that are controlled/changed via raftMultiplier
|
||
* <a name="_raft_snapshot_interval"></a><a href="#_raft_snapshot_interval">`-raft-snapshot-interval`</a> - This | ||
controls how often servers check if they need to save a snapshot to disk. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should actually not document these immediately.
I know Armon was concerned about them being kinda low level and as soon as we document them we are really committed to keeping them working/available. Once we have proper WAL we shouldn't really need to allow user to tune these.
I think if we do keep them though we should be more explicit about when you'd need to change them and what effect they have.
Something like:
-raft-snapshot-threshold
- This controls the minimum number of raft commit entries between snapshots that are saved to disk. This is a low-level parameter that should rarely need to be changed. Very busy clusters experiencing excessive disk IO because the servers are constantly snapshotting may increase this to reduce disk IO and increase the chance that multiple servers are not snapshotting at the same time. Increasing this trades off disk IO for disk space since the log will grow much larger and the space in the raft.db file can never be reclaimed. Servers may take longer to recover from crashes or failover if this is increased significantly as more logs will need to be replayed.
-raft-snapshot-interval
- This controls how often servers check if they need to save a snapshot to disk. This is a low-level parameter that should rarely need to be changed. Very busy clusters experiencing excessive disk IO because the servers are constantly snapshotting may increase this to reduce disk IO and increase the chance that multiple servers are not snapshotting at the same time. Increasing this trades off disk IO for disk space since the log will grow much larger and the space in the raft.db file can never be reclaimed. Servers may take longer to recover from crashes or failover if this is increased significantly as more logs will need to be replayed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
0fcf701
to
ca67094
Compare
This PR adds
raft_snapshot_threshold
as a new configurable param to pass down to the raft layer. This was previously always set to 8192 and not modifiable by Consul operators. This change should help operators of Consul in larger installations with lots of writes to have better control over how often snapshots are taken.