Add more Raft metrics regarding log committal. #5488
Labels
theme/telemetry
Anything related to telemetry or observability
type/enhancement
Proposed improvement or new feature
Milestone
Currently there are a few relevant metrics that Consul will emit regarding performance of Raft log entry committal.
consul.raft.commitTime
- This is a pretty coarse grained metric. It tracks the time from when a batch of logs was "dispatched" up until each individual log was ready to be submitted for FSM application. This will also encompass everything thatconsul.raft.leader.dispatchLogs
does for the leader. Submitting a log to be applied to the FSM just sends it in achan
. That chan is hardcoded to be buffered to 128 log entries and it will block and prevent processing more logs until we can submit the log into the chan.consul.raft.fsm.apply
- This records how long it takes to apply each log entry in the Consul FSM.consul.raft.leader.dispatchLogs
- This mostly just records how long it takes to store a batch of logs to disk. It encompasses a a little more than just that though. Mostly this is meant to record how long StoreLogs operations take. For Consul that is going to involve BoltDB.One big issue with these metrics is that they can be misleading. For example if Raft is committing logs in batches of 10 and its taking 10ms to commit them (
consul.raft.commitTime
metric) and then it starts taking 20ms to commit batches when they increased to a batch size of 100, it could lead an operator to believe that it is now taking longer to commit logs when in fact its taking less. For the first scenario it is taking roughly 1ms per log to commit and in the second scenario its taking .2ms per log. So while things have actually improved in efficiency the only metrics we have to go off look like things are taking a turn for the worst because they do not reflect the batch sizes.Another issue is that the metrics we do have are too coarse. One example of this is the
consul.raft.commitTime
metric. After logs are stored on disk it could it could take some time before we start to apply them to the FSM. We have no way of knowing whether things are slowing down due to FSM applies vs latency to start applying anything to the FSM.Therefore I think we should add some new metrics:
consul.raft.leader.dispatchNumLogs
- This would be the batch size of logs being dispatched.consul.raft.commitNumLogs
- This would be the number of logs processed for application to the FSM in a single round. This could be just the same as thedispatchNumLogs
but it is also possible that logs get dispatched multiple times before we start the FSM application process.consul.raft.fsm.enqueue
- This would track the amount of time it takes to enqueue a batch of logs for FSM application.With these new metrics we could:
All of this would need to be done within the hashicorp/raft repository and revendored into Consul.
The text was updated successfully, but these errors were encountered: