-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch latency and batch size metrics to Alternator dashboard #2380
Comments
I think that by now we have 20 issues open on this feature ;-) But it seems that none of these issues spent more than a couple of words trying to say:
@tzach I think you or @amnonh should have given these "requirements" questions some thought (and document your decisions or ideas in an issue) before going ahead and implementing. |
The suffix size refers to batch size. To be clean (as it appears in different issues), whatever the decision will be, it will not be in a new dashboard; it belongs to the Alternator dashboard. I lean towards showing the number of batches and the ops in a batch one next to the other, but we can split that. |
Maybe it's clear in your mind, but to most readers it will be ambiguous - is the batch "size" its size in number of operations, or in number of bytes? This all discussion thread started when @mykaul saw this metric and wasn't sure what it was counting. He made the correct guess that it was the number of items in a batch, not their size, but it was impossible to get an answer (the documentation string of this metric doesn't explain).
You just sent a patch to the monitoring repo to remove it from the "op" dashboard (I presume, but didn't see, you moved it to a different dashboard). This means you decided that it's not an "op". If this is the case, I asking you to fix this classification also in scylla.git - put this metric outside the "operations" tree - and don't just leave it wrong in scylla.git and "correct" it in the monitoring repo.
Maybe, but you need to come up with consistent design for the dashboards and metric names, and good explanations for them. Right now, the op dashboard and "operations" metrics tree are all real DynamoDB API requests - CreateTable, UpdateItem, etc. This is what this dashboard counts. We can additionally count how many items a BatchGetItem read, how many bytes it read, etc., just as we might want to know how many tables a CreateTable really created (it can create one base table and 20 materialized views in one operation) - but these are not separate API operations and it's confusing to mix those up. Note that the comment that this is confusing didn't come from me - it was @mykaul who noticed the new "BatchGetItemSize" on his dashboard and got confused and opened the issue.
I don't know the terminology. "Alternator dashboard" is a big thing. I don't know how they call a single "graph" in the big "Alternator dashboard". What bothered @mykaul (and me) was that there was a graph labeled "operations" and it had real operations like UpdateItem, GetItem, CreateTable, and BatchGetItem - and also this unfamiliar BatchGetItemSize. I see in your monitoring patch you removed it from this graph. I said that if you removed it from the graph, you should have renamed the metric - it's not "fair" to leave the metric with the wrong name and then artificially exclude it from the graph despite its name saying it should be in that graph.
You can do that graphically, but the "number of items in a batch" should still need a different explanation string - it's not "number of BatchGetItemSize operations" like the rest! Moreover, as I noted, BatchWriteItem actually has two different operations - a write and a delete. If you're already counting sub-operations, wouldn't you want to differentiate the two types? |
There are two issues: one is the metrics' names. I don't have a strong opinion about it and will be happy to change it to anything else that you find clearer. I've added the explanation why batch sizes where first labeled as an op, as it's good that when we are changing something, we add the arguments to any future reader. There's no point in arguing with the past, but it's good to explain it.
Again, any other name is fine with me. Regardless of what will be the final metrics' names, I think it's better to show the batch-related panels together:
The term Dashboard is the entire page.
It's a good idea to split the metric by ops types. |
See
scylladb/scylladb@390e016
The text was updated successfully, but these errors were encountered: