-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,storage: request evaluation metrics #65414
Comments
This all sounds useful to me.
The one premature nit I have is to do whatever we do through a structured interface, not through magic context info. |
I agree that request execution metrics would be useful, i.e. metrics // in batcheval.EvaluateFoo:
stats := somepool.Get().(*pebble.IterStats)
it := eng.NewIterator(pebble.IterOptions{Stats: stats}
doSomething(it)
metrics.RecordStats(stats)
*stats = pebble.IterStats{}
somepool.Put(stats) I'm not in principal opposed to using a |
I think for the write path |
Gave this a quick prototype here: #79031 I agree that all that's needed is some polish and maybe a bit of refactoring, especially for the read metrics. Since the writes are more important, we could do those as a first pass. The one possible complication here is the added overhead of the repeated |
These metrics would also have been helpful for (internal) https://github.com/cockroachlabs/support/issues/1503 to determine why stores were filling up. We would've looked at the influx per range and would've likely noticed a particular TableID prefix receiving a large number of bytes in AddSSTable requests that could then have been associated with a job active on that table. And even just at first glance seeing lots of AddSSTable ingested by that node could've already prompted us to pausing all jobs, thereby addressing the immediate issue. One question is whether we also ought to have below-raft metrics that make observable the influx of data on a per-range level. For example, of n1 is the leaseholder and proposes all of the ingestions, then n2 (a follower) wouldn't register anything on the metrics, since they reach it below raft. The prototype above touches upon this in a TODO, but if we went as far as replicating the metrics, we could apply them on followers as well. This amounts to saying that read metrics would be evaluation-time (as they should be) but write metrics being apply-time (or at least log-append-time, though implementing a log-append metrics on followers is annoying so we would probably sweep that difference under the rug). An argument for eval-time is that a long uncommitted raft log can see sustained writes but if nothing is ever applied, the metrics would give a false impression. The question remains what metrics would then be useful to pinpoint where the influx on a slow follower originates from. We need to think about this a little more. |
also related to #71169 in a sense that that too asks for a more granular breakdown of work KV does (though that issue cares about latency). |
This came up in the context of a support issue, where it was unclear what was causing the write load on a node.
@aayushshah15 @andreimatei @tbg
gz#8437
Jira issue: CRDB-7596
The text was updated successfully, but these errors were encountered: