Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more Raft metrics regarding log committal. #5488

Closed
mkeeler opened this issue Mar 14, 2019 · 0 comments
Closed

Add more Raft metrics regarding log committal. #5488

mkeeler opened this issue Mar 14, 2019 · 0 comments
Assignees
Labels
theme/telemetry Anything related to telemetry or observability type/enhancement Proposed improvement or new feature
Milestone

Comments

@mkeeler
Copy link
Member

mkeeler commented Mar 14, 2019

Currently there are a few relevant metrics that Consul will emit regarding performance of Raft log entry committal.

  • consul.raft.commitTime - This is a pretty coarse grained metric. It tracks the time from when a batch of logs was "dispatched" up until each individual log was ready to be submitted for FSM application. This will also encompass everything that
  • consul.raft.leader.dispatchLogs does for the leader. Submitting a log to be applied to the FSM just sends it in a chan. That chan is hardcoded to be buffered to 128 log entries and it will block and prevent processing more logs until we can submit the log into the chan.
  • consul.raft.fsm.apply - This records how long it takes to apply each log entry in the Consul FSM.
  • consul.raft.leader.dispatchLogs - This mostly just records how long it takes to store a batch of logs to disk. It encompasses a a little more than just that though. Mostly this is meant to record how long StoreLogs operations take. For Consul that is going to involve BoltDB.

One big issue with these metrics is that they can be misleading. For example if Raft is committing logs in batches of 10 and its taking 10ms to commit them (consul.raft.commitTime metric) and then it starts taking 20ms to commit batches when they increased to a batch size of 100, it could lead an operator to believe that it is now taking longer to commit logs when in fact its taking less. For the first scenario it is taking roughly 1ms per log to commit and in the second scenario its taking .2ms per log. So while things have actually improved in efficiency the only metrics we have to go off look like things are taking a turn for the worst because they do not reflect the batch sizes.

Another issue is that the metrics we do have are too coarse. One example of this is the consul.raft.commitTime metric. After logs are stored on disk it could it could take some time before we start to apply them to the FSM. We have no way of knowing whether things are slowing down due to FSM applies vs latency to start applying anything to the FSM.

Therefore I think we should add some new metrics:

  • consul.raft.leader.dispatchNumLogs - This would be the batch size of logs being dispatched.
  • consul.raft.commitNumLogs - This would be the number of logs processed for application to the FSM in a single round. This could be just the same as the dispatchNumLogs but it is also possible that logs get dispatched multiple times before we start the FSM application process.
  • consul.raft.fsm.enqueue - This would track the amount of time it takes to enqueue a batch of logs for FSM application.

With these new metrics we could:

  • Calculate the average time it takes to commit a single log. (Using the commitTime and the commitNumLogs metrics)
  • Gain insight into the batching process.
  • Track any backup due to the FSM application process slowing down.

All of this would need to be done within the hashicorp/raft repository and revendored into Consul.

@pearkes pearkes added this to the 1.5.0 milestone Mar 22, 2019
@pearkes pearkes added type/enhancement Proposed improvement or new feature theme/telemetry Anything related to telemetry or observability labels Mar 22, 2019
@freddygv freddygv self-assigned this Apr 4, 2019
@freddygv freddygv closed this as completed Apr 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/telemetry Anything related to telemetry or observability type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

3 participants