Add batch latency and batch size metrics to Alternator dashboard #2380

tzach · 2024-08-28T06:33:56Z

See
scylladb/scylladb@390e016

The text was updated successfully, but these errors were encountered:

nyh · 2024-09-15T21:38:54Z

I think that by now we have 20 issues open on this feature ;-) But it seems that none of these issues spent more than a couple of words trying to say:

What these metrics might be named (their full "path" in the metric hierarchy) and why.
Where these metrics will be in the Grafana dashboard. @amnonh originally added them together with the other operations, so in addition to GetItem and BatchGetItem we also have a fake-operation BatchGetItemSize on the same graph - but @mykaul didn't like that so Amnon is removing it from the operations dashboard. But where will it move to?

@tzach I think you or @amnonh should have given these "requirements" questions some thought (and document your decisions or ideas in an issue) before going ahead and implementing.

amnonh · 2024-09-16T06:39:24Z

The suffix size refers to batch size.
I don't have a strong opinion about whether that's an op or not. One comment/issue we had was that when talking about batches, counting the number of batches does not reflect the number of operations, and in a sense, the new metrics that count how many items were in each batch are the actual value.

To be clean (as it appears in different issues), whatever the decision will be, it will not be in a new dashboard; it belongs to the Alternator dashboard.

I lean towards showing the number of batches and the ops in a batch one next to the other, but we can split that.

nyh · 2024-09-16T08:35:15Z

The suffix size refers to batch size.

Maybe it's clear in your mind, but to most readers it will be ambiguous - is the batch "size" its size in number of operations, or in number of bytes? This all discussion thread started when @mykaul saw this metric and wasn't sure what it was counting. He made the correct guess that it was the number of items in a batch, not their size, but it was impossible to get an answer (the documentation string of this metric doesn't explain).

I don't have a strong opinion about whether that's an op or not.

You just sent a patch to the monitoring repo to remove it from the "op" dashboard (I presume, but didn't see, you moved it to a different dashboard). This means you decided that it's not an "op". If this is the case, I asking you to fix this classification also in scylla.git - put this metric outside the "operations" tree - and don't just leave it wrong in scylla.git and "correct" it in the monitoring repo.

One comment/issue we had was that when talking about batches, counting the number of batches does not reflect the number of operations, and in a sense, the new metrics that count how many items were in each batch are the actual value.

Maybe, but you need to come up with consistent design for the dashboards and metric names, and good explanations for them. Right now, the op dashboard and "operations" metrics tree are all real DynamoDB API requests - CreateTable, UpdateItem, etc. This is what this dashboard counts. We can additionally count how many items a BatchGetItem read, how many bytes it read, etc., just as we might want to know how many tables a CreateTable really created (it can create one base table and 20 materialized views in one operation) - but these are not separate API operations and it's confusing to mix those up.

Note that the comment that this is confusing didn't come from me - it was @mykaul who noticed the new "BatchGetItemSize" on his dashboard and got confused and opened the issue.

To be clean (as it appears in different issues), whatever the decision will be, it will not be in a new dashboard; it belongs to the Alternator dashboard.

I don't know the terminology. "Alternator dashboard" is a big thing. I don't know how they call a single "graph" in the big "Alternator dashboard". What bothered @mykaul (and me) was that there was a graph labeled "operations" and it had real operations like UpdateItem, GetItem, CreateTable, and BatchGetItem - and also this unfamiliar BatchGetItemSize. I see in your monitoring patch you removed it from this graph. I said that if you removed it from the graph, you should have renamed the metric - it's not "fair" to leave the metric with the wrong name and then artificially exclude it from the graph despite its name saying it should be in that graph.

I lean towards showing the number of batches and the ops in a batch one next to the other, but we can split that.

You can do that graphically, but the "number of items in a batch" should still need a different explanation string - it's not "number of BatchGetItemSize operations" like the rest! Moreover, as I noted, BatchWriteItem actually has two different operations - a write and a delete. If you're already counting sub-operations, wouldn't you want to differentiate the two types?

amnonh · 2024-09-16T13:43:25Z

There are two issues: one is the metrics' names. I don't have a strong opinion about it and will be happy to change it to anything else that you find clearer. I've added the explanation why batch sizes where first labeled as an op, as it's good that when we are changing something, we add the arguments to any future reader. There's no point in arguing with the past, but it's good to explain it.
The second issue is the visual representation, which is what this repository is mostly about.

The suffix size refers to batch size.

Maybe it's clear in your mind, but to most readers it will be ambiguous - is the batch "size" its size in number of operations, or in number of bytes?

Again, any other name is fine with me.

Regardless of what will be the final metrics' names, I think it's better to show the batch-related panels together:
A panel with how many batches, a panel with how many items came from batches (which some would argue is the real graph users care about), and the average batch size (the ratio between the two).

I don't know the terminology. "Alternator dashboard" is a big thing. I don't know how they call a single "graph" in the big "Alternator dashboard".

The term Dashboard is the entire page.
A dashboard has sections, rows, and panels. (you can think of it as a short story, paragraph, sentences, and words)

Moreover, as I noted, BatchWriteItem actually has two different operations - a write and a delete. If you're already counting sub-operations, wouldn't you want to differentiate the two types?

It's a good idea to split the metric by ops types.

tzach added the enhancement New feature or request label Aug 28, 2024

amnonh added this to the Monitoring 4.8.1 milestone Sep 12, 2024

amnonh mentioned this issue Sep 12, 2024

alternator: Add BatchWriteItem,BatchGetItem #2394

Merged

amnonh closed this as completed in #2394 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch latency and batch size metrics to Alternator dashboard #2380

Add batch latency and batch size metrics to Alternator dashboard #2380

tzach commented Aug 28, 2024

nyh commented Sep 15, 2024

amnonh commented Sep 16, 2024

nyh commented Sep 16, 2024

amnonh commented Sep 16, 2024

Add batch latency and batch size metrics to Alternator dashboard #2380

Add batch latency and batch size metrics to Alternator dashboard #2380

Comments

tzach commented Aug 28, 2024

nyh commented Sep 15, 2024

amnonh commented Sep 16, 2024

nyh commented Sep 16, 2024

amnonh commented Sep 16, 2024