storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

petermattis · 2016-11-22T01:13:42Z

The previous settings allowed up to 256 MB of Raft log entries to be
inflight to a follower, resulting in a single Replica.handleRaftReady
call processing thousands or 10s of thousands of commands.

Log the number of commands processed when Replica.handleRaftReady takes
too long.

Fixes #10917

This change is

petermattis · 2016-11-22T01:17:59Z

I'm not sure what the downsides there are to doing this. The upsides are pretty significant in testing. This completely eliminates long Replica.handleRaftReady operations in the scenario described in #10917.

It might be worth exploring if there is another option. raft.RawNode.Advance() takes a raft.Ready and we currently always pass back the raft.Ready we received from raft.RawNode.Ready(), but I'm not sure that is required. It looks like we could choose to execute a prefix of the raft.Ready.CommittedEntries trimming of the suffix which we didn't execute.

petermattis · 2016-11-22T02:56:05Z

Ok, I looked into the alternate suggestion and it is raft.Ready.HardState.Commit that needs to be adjusted, not truncating raft.Ready.CommittedEntries. But that does appear to work, though it feels a bit fragile. The nice bit about this approach is that we can precisely control how long we're willing to let Replica.handleRaftReady take. Specifically, the committed entries loop looks like:

	for i, e := range rd.CommittedEntries {
		if d := timeutil.Since(start); d >= r.store.cfg.RaftTickInterval {
			rd.HardState.Commit = rd.CommittedEntries[i-1].Index
			stats.reschedule = true
			break
		}
		...
	}

bdarnell · 2016-11-22T05:39:27Z

First commit

Tweaking the Ready struct we pass to Advance is scary. However, raft doesn't care whether we've processed all committed entries before calling Advance. We could pass the committed entries off to another worker queue and free up the raft loop. The one caveat is that I think any config changes force synchronization here - if a batch includes EntryConfChanges, you must process those and call ApplyConfChange before Advance.

Reviewed 2 of 2 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica.go, line 2447 at r1 (raw file):

				return stats, err
			}
			if pErr := r.processRaftCommand(

Also increment stats.processed here.

pkg/storage/store.go, line 177 at r1 (raw file):

		// a follower without hearing a response. The total number of Raft log
		// entries is a combination of this setting and MaxSizePerMsg. The current
		// settings provide for up to 64 KB of raft log to be sent without

That's really going to hurt performance of hot ranges in geographically-distributed deployments. Maybe after this change we should figure out how to release the lock in between processing commands instead of limiting throughput so much.

Comments from Reviewable

petermattis · 2016-11-22T14:58:01Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica.go, line 2447 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

> Also increment stats.processed here.

Done.

pkg/storage/store.go, line 177 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

> That's really going to hurt performance of hot ranges in geographically-distributed deployments. Maybe after this change we should figure out how to release the lock in between processing commands instead of limiting throughput so much.

Releasing the lock between processing commands is problematic as we're still running on the Raft scheduler which guarantees only a single worker is processing a range. Tossing the execution of `CommittedEntries` to another worker is going to be mildly complicated as we need to ensure they are executed in order. Whacking `raft.Ready.HardState.Commit` as I do in the second commit is ugly, but it accomplishes that effect without significant complexity. Maybe we should see about an upstream change to raft to provide a better signal that only a fraction of the `CommittedEntries` have been executed.

Comments from Reviewable

The previous settings allowed up to 256 MB of Raft log entries to be inflight to a follower, resulting in a single Replica.handleRaftReady call processing thousands or 10s of thousands of commands. Log the number of commands processed when Replica.handleRaftReady takes too long. Fixes cockroachdb#10917

petermattis · 2016-11-22T15:05:38Z

Ok, I removed the second commit and I've also add env vars to make MaxSizePerMsg and MaxInflightMsgs configurable. I've defaulted the values to 16 KB and 4 respectively, matching what worked well in testing.

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

tamird · 2016-11-22T17:51:26Z

Reviewed 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/store.go, line 2977 at r2 (raw file):

	} else {
		// Force the replica to deal with this snapshot right now.
		if _, err := r.handleRaftReadyRaftMuLocked(inSnap); err != nil {

stats is discarded here because we believe it's always 1? or do we just not care?

Comments from Reviewable

petermattis · 2016-11-22T17:54:48Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/store.go, line 2977 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

> `stats` is discarded here because we believe it's always 1? or do we just not care?

Both. It's always `1` and we don't log anything on this code path anyways.

Comments from Reviewable

petermattis assigned bdarnell Nov 22, 2016

petermattis force-pushed the pmattis/tune-max-size-per-msg branch from 01b0fa6 to 89e8fe3 Compare November 22, 2016 15:04

petermattis merged commit 30e75cc into cockroachdb:master Nov 22, 2016

petermattis deleted the pmattis/tune-max-size-per-msg branch November 22, 2016 15:43

bdarnell mentioned this pull request Dec 2, 2016

stability: raft log panic on fresh cluster #11591

Closed

bdarnell mentioned this pull request Oct 16, 2018

storage: divergent entry application can lead to request stalls on Range splits #31330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

petermattis commented Nov 22, 2016 •

edited by rjnn

Loading

petermattis commented Nov 22, 2016

petermattis commented Nov 22, 2016

bdarnell commented Nov 22, 2016

petermattis commented Nov 22, 2016

petermattis commented Nov 22, 2016

tamird commented Nov 22, 2016

petermattis commented Nov 22, 2016

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

Conversation

petermattis commented Nov 22, 2016 • edited by rjnn Loading

petermattis commented Nov 22, 2016

petermattis commented Nov 22, 2016

bdarnell commented Nov 22, 2016

petermattis commented Nov 22, 2016

petermattis commented Nov 22, 2016

tamird commented Nov 22, 2016

petermattis commented Nov 22, 2016

petermattis commented Nov 22, 2016 •

edited by rjnn

Loading