storage: examine raft log truncation heuristics #10917

petermattis · 2016-11-21T19:36:29Z

In experimentation on blue, I've noticed that the longer a node is down for the more instances I see of long lock hold warnings. The down times I experimented with were in the 30s-2m range and the warnings are coming out of Replica.handleRaftReady. I also noticed that ranges were being caught up via log appends, not snapshots. The raft log truncation heuristics allow the raft log to grow up to the size of Replica before it is truncated. But perhaps that isn't aggressive enough.

A related item to investigate is raft.Config.MaxSizePerMsg. We currently have this set to 1MB. A bit of instrumentation on local load similar to blue shows that 1MB of raft log held ~26k entries. That translates to ~40 bytes per entry which doesn't seem right. The problem with setting MaxSizePerMsg high is that is the unit which Raft will use to catch up a behind node via log appends. Processing ~26k entries in a single Raft ready message would be excessive. One thought which comes to mind is to limit the number of entries we return from Replica.Entries. We could either reinterpret MaxSizePerMsg (the maxBytes parameter) as a number of entries, or have another limit (though that would require care as raft internally needs to retrieve all of the entries when a replica becomes the leader).

@bdarnell thoughts?

The text was updated successfully, but these errors were encountered:

petermattis · 2016-11-21T20:40:55Z

Ah, I messed up my instrumentation. The raft log entries are actually ~640 bytes per entry. Still that translates can translate into 1638 raft entries in a single MsgApp. If the MsgApp is catching a replica up we might then perform 1638 operations while holding Replica.raftMu.

petermattis · 2016-11-21T20:43:30Z

Btw, if you want to reproduce this locally:

~ killall -9 cockroach; rm -fr cockroach-data/ && roachdemo -n 3 -- --logtostderr
...
~ block_writer -tolerate-errors -min-block-bytes=8 -max-block-bytes=128
# let run for 60 secs
# kill node 3
# let run for another 20-30 secs
# restart node 3

petermattis · 2016-11-21T20:58:19Z

Adding some instrumentation to Replica.handleRaftReady, the above script produced the following:

W161121 15:43:09.394888 140 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.4s appended=1659 committed=1659 processed=1659
W161121 15:43:11.156264 140 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 1.8s appended=7538 committed=7536 processed=7536
W161121 15:43:13.097163 142 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 1.9s appended=1661 committed=1663 processed=1663
W161121 15:43:13.517186 142 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.4s appended=986 committed=977 processed=977
W161121 15:43:13.853351 134 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.3s appended=1653 committed=1662 processed=1662
W161121 15:43:14.600005 134 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.7s appended=2771 committed=2769 processed=2769
W161121 15:43:15.078689 133 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.5s appended=1663 committed=1663 processed=1663

appended indicates how many entries were appended to the local raft log. committed is the number of committed entries and processed is the number of raft commands we processed (usually equal to committed).

I think the above is pretty good evidence that at least some of our long raftMu hold times are due to these large MsgApps. Are these actually a problem, though? The Replica is behind, so perhaps it isn't a problem that we're holding Replica.raftMu for a while. The downside to holding Replica.raftMu for so long is that the replica won't respond to heartbeats and local requests to the replica will be blocked. Hmm, that might affect local lease proposals.

The previous settings allowed up to 256 MB of Raft log entries to be inflight to a follower, resulting in a single Replica.handleRaftReady call processing thousands or 10s of thousands of commands. Log the number of commands processed when Replica.handleRaftReady takes too long. Fixes cockroachdb#10917

petermattis mentioned this issue Nov 22, 2016

storage: tune raft.Config.{MaxSizePerMsg,MaxInflightMsgs} #10929

Merged

petermattis closed this as completed in #10929 Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: examine raft log truncation heuristics #10917

storage: examine raft log truncation heuristics #10917

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016

storage: examine raft log truncation heuristics #10917

storage: examine raft log truncation heuristics #10917

Comments

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016

petermattis commented Nov 21, 2016