Drop synapse_storage_transaction_time_bucket #11124

daenney · 2021-10-19T13:58:19Z

This particular metric has a much too high cardinality due to the fact
that the desc label can have (at present) 248 values. This results in
over 3k series per Synapse. If you have a Prometheus instance that
monitors multiple Synpase instances it results in a huge amount of
additional series to ingest.

The metric in question is also not used in the Synapse dashboard and
the core team has indicated they're happy to drop this metric entirely.

Fixes #11081

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
Pull request includes a sign off
Code style is correct (run the linters)

changelog.d/11124.removal

squahtx

LGTM! One minor suggestion about the wording of the changelog.

This particular metric has a much too high cardinality due to the fact that the desc label can have (at present) 248 values. This results in over 3k series per Synapse. If you have a Prometheus instance that monitors multiple Synpase instances it results in a huge amount of additional series to ingest. The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely. Fixes #11081 Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>

reivilibre · 2021-10-19T15:32:37Z

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

daenney · 2021-10-19T15:37:01Z

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

They are 🤦. They're created implicitly by the Histrogram from the Prometheus client library.

I'd still really like to drop this metric because it's problematic, but I guess we'll have to ship a small update to the dashboard to go with it?

squahtx · 2021-10-19T15:39:54Z

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

Good spot! docs/metrics-howto.md also mentions the metric and has its own changelog that could do with an entry

daenney · 2021-10-19T15:45:17Z

The alternative would be to find a way to reduce the cardinality of this metric to something more manageable. Right now 248 values multiplied by 15 buckets results in 3720 series per Synapse. Is there some other way we could slice synapse_storage_transaction_time_bucket so that we end up with <20 or so values for the desc label?

erikjohnston · 2021-10-19T15:48:24Z

The alternative would be to find a way to reduce the cardinality of this metric to something more manageable. Right now 248 values multiplied by 15 buckets results in 3720 series per Synapse. Is there some other way we could slice synapse_storage_transaction_time_bucket so that we end up with <20 or so values for the desc label?

I think we might be able to replace it with a Summary so that we only expose the _count and _sum labels: https://github.com/prometheus/client_python#summary?

erikjohnston · 2021-10-19T15:52:12Z

The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely.

I think we're certainly happy for Prometheus to be configured to drop these, I'm a little less sure about dropping it for everyone (though probably still fine). I am vaguely wondering if we want to have a config option to drop some of the less useful high cardinality metrics? In the same way you might have info vs debug logs, etc? That's probably more of a thingto think about if we want to drop other metrics that might still be useful for single server deployments

daenney · 2021-10-19T15:59:20Z

I think we might be able to replace it with a Summary so that we only expose the _count and _sum labels: https://github.com/prometheus/client_python#summary?

That sounds like a good middle ground. Looking at the dashboard, we do indeed only use the _sum and _count and don't reference individual buckets anywhere. I'd need to double check Prometheus isn't going to get upset if we suddenly change the metric type though. If it does we might have to rename the series.

The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely.

I think we're certainly happy for Prometheus to be configured to drop these, I'm a little less sure about dropping it for everyone (though probably still fine). I am vaguely wondering if we want to have a config option to drop some of the less useful high cardinality metrics? In the same way you might have info vs debug logs, etc? That's probably more of a thingto think about if we want to drop other metrics that might still be useful for single server deployments

It's certainly a possibility. I've never seen debug levels affect the emitted metrics in other software. At that point it's starting to feel more like tracing data to me which might be a good angle for this if we do want to collect this type of data at a higher granularity?

metric is not unused after all

reivilibre · 2022-01-11T10:03:22Z

Any thoughts on what's happening with this PR? (is it still wanted? is it going to be a configuration flag? ...)

daenney · 2022-01-11T10:50:30Z

Yeah, we still very much need this. The timeseries aren't all that useful and they take up a decent amount of space. We're simply dropping them at ingestion time on the EMS side, but it would be nicer to not have to.

Haven't had the time yet to check if I can change it to a summary type without consequence.

daenney · 2022-01-11T10:50:52Z

Though I suppose we can close this one, since we'll take a different approach to solving it.

daenney requested a review from a team as a code owner October 19, 2021 13:58

daenney force-pushed the daenney/fix-11081 branch 2 times, most recently from 686d94f to 1d64f2f Compare October 19, 2021 14:02

squahtx reviewed Oct 19, 2021

View reviewed changes

changelog.d/11124.removal Outdated Show resolved Hide resolved

squahtx previously approved these changes Oct 19, 2021

View reviewed changes

daenney force-pushed the daenney/fix-11081 branch from 1d64f2f to bed1e41 Compare October 19, 2021 15:27

daenney force-pushed the daenney/fix-11081 branch from bed1e41 to e08a76f Compare October 19, 2021 15:28

squahtx self-requested a review October 19, 2021 15:36

squahtx added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Nov 1, 2021

reivilibre removed the request for review from squahtx November 4, 2021 11:45

daenney closed this Jan 11, 2022

daenney deleted the daenney/fix-11081 branch January 11, 2022 10:50

richvdh mentioned this pull request Mar 22, 2022

synapse_storage_transaction_time_bucket prometheus metric has too high cardinality due to desc label #11081

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop synapse_storage_transaction_time_bucket #11124

Drop synapse_storage_transaction_time_bucket #11124

daenney commented Oct 19, 2021 •

edited

Loading

squahtx left a comment

reivilibre commented Oct 19, 2021

daenney commented Oct 19, 2021 •

edited

Loading

squahtx commented Oct 19, 2021

daenney commented Oct 19, 2021

erikjohnston commented Oct 19, 2021

erikjohnston commented Oct 19, 2021

daenney commented Oct 19, 2021 •

edited

Loading

reivilibre commented Jan 11, 2022

daenney commented Jan 11, 2022

daenney commented Jan 11, 2022

Drop synapse_storage_transaction_time_bucket #11124

Drop synapse_storage_transaction_time_bucket #11124

Conversation

daenney commented Oct 19, 2021 • edited Loading

Pull Request Checklist

squahtx left a comment

Choose a reason for hiding this comment

reivilibre commented Oct 19, 2021

daenney commented Oct 19, 2021 • edited Loading

squahtx commented Oct 19, 2021

daenney commented Oct 19, 2021

erikjohnston commented Oct 19, 2021

erikjohnston commented Oct 19, 2021

daenney commented Oct 19, 2021 • edited Loading

reivilibre commented Jan 11, 2022

daenney commented Jan 11, 2022

daenney commented Jan 11, 2022

daenney commented Oct 19, 2021 •

edited

Loading

daenney commented Oct 19, 2021 •

edited

Loading

daenney commented Oct 19, 2021 •

edited

Loading