Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: examine raft log truncation heuristics #10917

Closed
petermattis opened this issue Nov 21, 2016 · 3 comments
Closed

storage: examine raft log truncation heuristics #10917

petermattis opened this issue Nov 21, 2016 · 3 comments

Comments

@petermattis
Copy link
Collaborator

In experimentation on blue, I've noticed that the longer a node is down for the more instances I see of long lock hold warnings. The down times I experimented with were in the 30s-2m range and the warnings are coming out of Replica.handleRaftReady. I also noticed that ranges were being caught up via log appends, not snapshots. The raft log truncation heuristics allow the raft log to grow up to the size of Replica before it is truncated. But perhaps that isn't aggressive enough.

A related item to investigate is raft.Config.MaxSizePerMsg. We currently have this set to 1MB. A bit of instrumentation on local load similar to blue shows that 1MB of raft log held ~26k entries. That translates to ~40 bytes per entry which doesn't seem right. The problem with setting MaxSizePerMsg high is that is the unit which Raft will use to catch up a behind node via log appends. Processing ~26k entries in a single Raft ready message would be excessive. One thought which comes to mind is to limit the number of entries we return from Replica.Entries. We could either reinterpret MaxSizePerMsg (the maxBytes parameter) as a number of entries, or have another limit (though that would require care as raft internally needs to retrieve all of the entries when a replica becomes the leader).

@bdarnell thoughts?

@petermattis
Copy link
Collaborator Author

Ah, I messed up my instrumentation. The raft log entries are actually ~640 bytes per entry. Still that translates can translate into 1638 raft entries in a single MsgApp. If the MsgApp is catching a replica up we might then perform 1638 operations while holding Replica.raftMu.

@petermattis
Copy link
Collaborator Author

Btw, if you want to reproduce this locally:

~ killall -9 cockroach; rm -fr cockroach-data/ && roachdemo -n 3 -- --logtostderr
...
~ block_writer -tolerate-errors -min-block-bytes=8 -max-block-bytes=128
# let run for 60 secs
# kill node 3
# let run for another 20-30 secs
# restart node 3

@petermattis
Copy link
Collaborator Author

Adding some instrumentation to Replica.handleRaftReady, the above script produced the following:

W161121 15:43:09.394888 140 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.4s appended=1659 committed=1659 processed=1659
W161121 15:43:11.156264 140 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 1.8s appended=7538 committed=7536 processed=7536
W161121 15:43:13.097163 142 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 1.9s appended=1661 committed=1663 processed=1663
W161121 15:43:13.517186 142 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.4s appended=986 committed=977 processed=977
W161121 15:43:13.853351 134 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.3s appended=1653 committed=1662 processed=1662
W161121 15:43:14.600005 134 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.7s appended=2771 committed=2769 processed=2769
W161121 15:43:15.078689 133 storage/store.go:3218  [n3,s3,r7/2:/{Table/51-Max}] handle raft ready: 0.5s appended=1663 committed=1663 processed=1663

appended indicates how many entries were appended to the local raft log. committed is the number of committed entries and processed is the number of raft commands we processed (usually equal to committed).

I think the above is pretty good evidence that at least some of our long raftMu hold times are due to these large MsgApps. Are these actually a problem, though? The Replica is behind, so perhaps it isn't a problem that we're holding Replica.raftMu for a while. The downside to holding Replica.raftMu for so long is that the replica won't respond to heartbeats and local requests to the replica will be blocked. Hmm, that might affect local lease proposals.

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 22, 2016
The previous settings allowed up to 256 MB of Raft log entries to be
inflight to a follower, resulting in a single Replica.handleRaftReady
call processing thousands or 10s of thousands of commands.

Log the number of commands processed when Replica.handleRaftReady takes
too long.

Fixes cockroachdb#10917
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 22, 2016
The previous settings allowed up to 256 MB of Raft log entries to be
inflight to a follower, resulting in a single Replica.handleRaftReady
call processing thousands or 10s of thousands of commands.

Log the number of commands processed when Replica.handleRaftReady takes
too long.

Fixes cockroachdb#10917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant