-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
synapse_storage_transaction_time_bucket
prometheus metric has too high cardinality due to desc label
#11081
Comments
For reference, based on
|
I'd be pretty happy to drop this metric altogether. It's not used on our default grafana dashboard. |
synapse_storage_transaction_time_bucket
prometheus metric has too high cardinality due to desc label
While we're in the area, it might be worth noting that lots of prometheus metrics are duplicated: #11106. It's not quite a problem on the same scale as |
Yup, I noticed that too. Talking to @erikjohnston he was happy for us to stop ingesting this metric altogether, which strongly suggests we can drop it from the code base. |
This particular metric has a much too high cardinality due to the fact that the desc label can have (at present) 248 values. This results in over 3k series per Synapse. If you have a Prometheus instance that monitors multiple Synpase instances it results in a huge amount of additional series to ingest. The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely. Fixes #11081 Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
see #11124: turns out that |
I agree there's no reason for this to be a histogram and can easily be a summary. Histograms really only make sense when you want to perform statistical aggregations over multiple instances of the same service, but I guess nobody really runs multiple Synapse servers. But is there anything which actually, definitely, consumes this metric? Are there specific operational use cases which need this level of granularity? If those requirements can't be explicitly enumerated, then IMO the metric can safely be eliminated. "Nice to have" isn't by itself sufficient justification for a telemetry signal :) edit: I understand there's a dashboard which uses this data, but does it really need the level of granularity currently on offer, or can it provide similar utility with a less-precise label set? |
we have found these graphs on our grafana dashboard. They are even discussed in https://matrix-org.github.io/synapse/latest/usage/administration/understanding_synapse_through_grafana_graphs.html#transaction-count-and-transaction-duration. I would not be in favour of removing them. |
Sure, dashboards exist which consume them. But do those dashboards actually provide meaningful operational telemetry? The cardinality of this metric is extraordinarily high. It should deliver extraordinary value to justify that cost. Does it? |
yes |
sorry, I just realised I missed a word in my previous comment. It should read:
|
@richvdh Sweet! What problems have you been able to diagnose and fix? Which servlets have been optimized? Do you have links to issues? If there's data, I'm happy to withdraw my objections. |
For my own edification: For each
So I count 17 time series per named database transaction. If we just want to track the amount of time spent in a type of transaction (e.g. If we want to track the mean time a single transaction instance takes, we'd need at least two counters (one for tracking duration, and one for tracking how many txns occur). Prometheus histograms classify events based on fixed bucket thresholds, whereas Prometheus track the values of a given set of fixed percentiles. (Histogram: "how many requests take < 50ms"? Summary: "How long does it take to serve our quickest 99% of requests?) Or at least, that's how I'm interpreting https://prometheus.io/docs/concepts/metric_types/#histogram. A histogram that doesn't track any buckets and a summary that doesn't track any quartiles look the same to me. Additionally, the python client doesn't support tracking quantiles. |
Action for this issue, then:
If metrics are still eating up space in the future, we could
|
Summaries have the large downside that you can't aggregate with them, so I'm not generally keen on us using them unless we have a very specific use case in mind. I think the other two options here are:
|
What do you mean by "aggregate" here @erikjohnston? |
c.f. https://prometheus.io/docs/practices/histograms/#quantiles But basically aiui you can't merge together quantiles from multiple instances, so e.g. you wouldn't be able to see the quantile for a specific |
Right, but
|
By my reckoning, this should mean that each txn `desc` produces 2 time series, down from 17. If we don't like this, we could keep the histogram but configure it with coarser buckets. Closes #11081.
Ugh, right, I guess that is an easy hack to just keep the useful bits. But it'll suck if they ever decide to actually properly implement summaries.
Well yeah, but there's not much point in doing so if you're going to get dodgy answers out. |
sorry, if our goal is just to get |
That's fair. I can make it two counters. (I only mention summary based on what E said in #11124 (comment)) |
By my reckoning, this should mean that each txn `desc` produces 2 time series, down from 17. If we don't like this, we could keep the histogram but configure it with coarser buckets. Closes #11081.
By my reckoning, this should mean that each txn `desc` produces 2 time series, down from 17. If we don't like this, we could keep the histogram but configure it with coarser buckets. Closes #11081.
Description
The
desc
label onsynapse_storage_transaction_time_bucket
results in a very high cardinality metric. For a single instance, there's 248 variants ofdesc
, multiplied by 15 buckets. This results in over 3k series for a single host.Though this might be acceptable if you're only ingesting metrics for a single instance, for Prometheus instances that might be scraping multiple Synapses this quickly becomes a problem.
On one of our internal clusters this is resulting in over a million time series, which is about 5x the amount of time series of the next most problematic timeseries:
synapse_http_server_response_time_seconds_bucket
. Though this is not causing storage issues, it causes unnecessarily high CPU load and memory usage ballooning on the ingesters (so we have to run with much bigger instances) and querying these series become problematic.Steps to reproduce
Version information
The text was updated successfully, but these errors were encountered: