Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Nov 22, 2016

The previous settings allowed up to 256 MB of Raft log entries to be
inflight to a follower, resulting in a single Replica.handleRaftReady
call processing thousands or 10s of thousands of commands.

Log the number of commands processed when Replica.handleRaftReady takes
too long.

Fixes #10917


This change is Reviewable

@petermattis
Copy link
Collaborator Author

I'm not sure what the downsides there are to doing this. The upsides are pretty significant in testing. This completely eliminates long Replica.handleRaftReady operations in the scenario described in #10917.

It might be worth exploring if there is another option. raft.RawNode.Advance() takes a raft.Ready and we currently always pass back the raft.Ready we received from raft.RawNode.Ready(), but I'm not sure that is required. It looks like we could choose to execute a prefix of the raft.Ready.CommittedEntries trimming of the suffix which we didn't execute.

@petermattis
Copy link
Collaborator Author

Ok, I looked into the alternate suggestion and it is raft.Ready.HardState.Commit that needs to be adjusted, not truncating raft.Ready.CommittedEntries. But that does appear to work, though it feels a bit fragile. The nice bit about this approach is that we can precisely control how long we're willing to let Replica.handleRaftReady take. Specifically, the committed entries loop looks like:

	for i, e := range rd.CommittedEntries {
		if d := timeutil.Since(start); d >= r.store.cfg.RaftTickInterval {
			rd.HardState.Commit = rd.CommittedEntries[i-1].Index
			stats.reschedule = true
			break
		}
		...
	}

@bdarnell
Copy link
Contributor

First commit :lgtm:

Tweaking the Ready struct we pass to Advance is scary. However, raft doesn't care whether we've processed all committed entries before calling Advance. We could pass the committed entries off to another worker queue and free up the raft loop. The one caveat is that I think any config changes force synchronization here - if a batch includes EntryConfChanges, you must process those and call ApplyConfChange before Advance.


Reviewed 2 of 2 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 2447 at r1 (raw file):

				return stats, err
			}
			if pErr := r.processRaftCommand(

Also increment stats.processed here.


pkg/storage/store.go, line 177 at r1 (raw file):

		// a follower without hearing a response. The total number of Raft log
		// entries is a combination of this setting and MaxSizePerMsg. The current
		// settings provide for up to 64 KB of raft log to be sent without

That's really going to hurt performance of hot ranges in geographically-distributed deployments. Maybe after this change we should figure out how to release the lock in between processing commands instead of limiting throughput so much.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 2447 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote… > Also increment stats.processed here.
Done.

pkg/storage/store.go, line 177 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote… > That's really going to hurt performance of hot ranges in geographically-distributed deployments. Maybe after this change we should figure out how to release the lock in between processing commands instead of limiting throughput so much.
Releasing the lock between processing commands is problematic as we're still running on the Raft scheduler which guarantees only a single worker is processing a range. Tossing the execution of `CommittedEntries` to another worker is going to be mildly complicated as we need to ensure they are executed in order. Whacking `raft.Ready.HardState.Commit` as I do in the second commit is ugly, but it accomplishes that effect without significant complexity. Maybe we should see about an upstream change to raft to provide a better signal that only a fraction of the `CommittedEntries` have been executed.

Comments from Reviewable

The previous settings allowed up to 256 MB of Raft log entries to be
inflight to a follower, resulting in a single Replica.handleRaftReady
call processing thousands or 10s of thousands of commands.

Log the number of commands processed when Replica.handleRaftReady takes
too long.

Fixes cockroachdb#10917
@petermattis petermattis force-pushed the pmattis/tune-max-size-per-msg branch from 01b0fa6 to 89e8fe3 Compare November 22, 2016 15:04
@petermattis
Copy link
Collaborator Author

Ok, I removed the second commit and I've also add env vars to make MaxSizePerMsg and MaxInflightMsgs configurable. I've defaulted the values to 16 KB and 4 respectively, matching what worked well in testing.


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@petermattis petermattis merged commit 30e75cc into cockroachdb:master Nov 22, 2016
@petermattis petermattis deleted the pmattis/tune-max-size-per-msg branch November 22, 2016 15:43
@tamird
Copy link
Contributor

tamird commented Nov 22, 2016

Reviewed 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/store.go, line 2977 at r2 (raw file):

	} else {
		// Force the replica to deal with this snapshot right now.
		if _, err := r.handleRaftReadyRaftMuLocked(inSnap); err != nil {

stats is discarded here because we believe it's always 1? or do we just not care?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/store.go, line 2977 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote… > `stats` is discarded here because we believe it's always 1? or do we just not care?
Both. It's always `1` and we don't log anything on this code path anyways.

Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants