Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Drop synapse_storage_transaction_time_bucket #11124

Closed
wants to merge 1 commit into from

Conversation

daenney
Copy link
Contributor

@daenney daenney commented Oct 19, 2021

This particular metric has a much too high cardinality due to the fact
that the desc label can have (at present) 248 values. This results in
over 3k series per Synapse. If you have a Prometheus instance that
monitors multiple Synpase instances it results in a huge amount of
additional series to ingest.

The metric in question is also not used in the Synapse dashboard and
the core team has indicated they're happy to drop this metric entirely.

Fixes #11081

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
  • Pull request includes a sign off
  • Code style is correct (run the linters)

@daenney daenney requested a review from a team as a code owner October 19, 2021 13:58
@daenney daenney force-pushed the daenney/fix-11081 branch 2 times, most recently from 686d94f to 1d64f2f Compare October 19, 2021 14:02
changelog.d/11124.removal Outdated Show resolved Hide resolved
squahtx
squahtx previously approved these changes Oct 19, 2021
Copy link
Contributor

@squahtx squahtx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One minor suggestion about the wording of the changelog.

This particular metric has a much too high cardinality due to the fact
that the desc label can have (at present) 248 values. This results in
over 3k series per Synapse. If you have a Prometheus instance that
monitors multiple Synpase instances it results in a huge amount of
additional series to ingest.

The metric in question is also not used in the Synapse dashboard and
the core team has indicated they're happy to drop this metric entirely.

Fixes #11081

Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
@reivilibre
Copy link
Contributor

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

@squahtx squahtx self-requested a review October 19, 2021 15:36
@daenney
Copy link
Contributor Author

daenney commented Oct 19, 2021

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

They are 🤦. They're created implicitly by the Histrogram from the Prometheus client library.

I'd still really like to drop this metric because it's problematic, but I guess we'll have to ship a small update to the dashboard to go with it?

@squahtx
Copy link
Contributor

squahtx commented Oct 19, 2021

Forgive me, for I am not a Prometheus expert, but are synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum (which are mentioned in the Grafana dashboard JSON) not part of this metric that you wish to remove?

Good spot! docs/metrics-howto.md also mentions the metric and has its own changelog that could do with an entry

@daenney
Copy link
Contributor Author

daenney commented Oct 19, 2021

The alternative would be to find a way to reduce the cardinality of this metric to something more manageable. Right now 248 values multiplied by 15 buckets results in 3720 series per Synapse. Is there some other way we could slice synapse_storage_transaction_time_bucket so that we end up with <20 or so values for the desc label?

@erikjohnston
Copy link
Member

The alternative would be to find a way to reduce the cardinality of this metric to something more manageable. Right now 248 values multiplied by 15 buckets results in 3720 series per Synapse. Is there some other way we could slice synapse_storage_transaction_time_bucket so that we end up with <20 or so values for the desc label?

I think we might be able to replace it with a Summary so that we only expose the _count and _sum labels: https://github.com/prometheus/client_python#summary?

@erikjohnston
Copy link
Member

The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely.

I think we're certainly happy for Prometheus to be configured to drop these, I'm a little less sure about dropping it for everyone (though probably still fine). I am vaguely wondering if we want to have a config option to drop some of the less useful high cardinality metrics? In the same way you might have info vs debug logs, etc? That's probably more of a thingto think about if we want to drop other metrics that might still be useful for single server deployments

@daenney
Copy link
Contributor Author

daenney commented Oct 19, 2021

I think we might be able to replace it with a Summary so that we only expose the _count and _sum labels: https://github.com/prometheus/client_python#summary?

That sounds like a good middle ground. Looking at the dashboard, we do indeed only use the _sum and _count and don't reference individual buckets anywhere. I'd need to double check Prometheus isn't going to get upset if we suddenly change the metric type though. If it does we might have to rename the series.

The metric in question is also not used in the Synapse dashboard and the core team has indicated they're happy to drop this metric entirely.

I think we're certainly happy for Prometheus to be configured to drop these, I'm a little less sure about dropping it for everyone (though probably still fine). I am vaguely wondering if we want to have a config option to drop some of the less useful high cardinality metrics? In the same way you might have info vs debug logs, etc? That's probably more of a thingto think about if we want to drop other metrics that might still be useful for single server deployments

It's certainly a possibility. I've never seen debug levels affect the emitted metrics in other software. At that point it's starting to feel more like tracing data to me which might be a good angle for this if we do want to collect this type of data at a higher granularity?

@squahtx squahtx added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Nov 1, 2021
@reivilibre reivilibre dismissed squahtx’s stale review November 4, 2021 11:44

metric is not unused after all

@reivilibre reivilibre removed the request for review from squahtx November 4, 2021 11:45
@reivilibre
Copy link
Contributor

Any thoughts on what's happening with this PR? (is it still wanted? is it going to be a configuration flag? ...)

@daenney
Copy link
Contributor Author

daenney commented Jan 11, 2022

Yeah, we still very much need this. The timeseries aren't all that useful and they take up a decent amount of space. We're simply dropping them at ingestion time on the EMS side, but it would be nicer to not have to.

Haven't had the time yet to check if I can change it to a summary type without consequence.

@daenney
Copy link
Contributor Author

daenney commented Jan 11, 2022

Though I suppose we can close this one, since we'll take a different approach to solving it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

synapse_storage_transaction_time_bucket prometheus metric has too high cardinality due to desc label
4 participants