Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Allow synapse_http_server_response_time_seconds Grafana histogram quantiles to show values bigger than 10s #13478

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/13478.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix the `synapse_http_server_response_time_seconds` metric not having buckets big enough for requests that take more than 10s.
22 changes: 22 additions & 0 deletions synapse/http/request_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,28 @@
"synapse_http_server_response_time_seconds",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular insight we're hoping to gain by raising the 10 second cap?

I'm trying to optimize the slow /messages requests, #13356, specifically those that take more than 10s.

In order to track progress there, I'd like the metrics to capture them.

"sec",
["method", "servlet", "tag", "code"],
buckets=(
0.005,
0.01,
0.025,
0.05,
0.075,
0.1,
0.25,
0.5,
0.75,
1.0,
2.5,
5.0,
7.5,
10.0,
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section matches the default buckets: https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544 (0.005 - 10.0)

I chose the default as a base because that is what it was using before. Do we want to tune these or eliminate any to reduce cardinality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding ~30% more buckets seems like a step in the wrong direction for #11082.

I've just noticed this comment. Perhaps we could drop the 0.075, 0.75 and 7.5 metrics? Then the remaining ones would be separated by roughly factors of two.

We'd still be growing the number of buckets by 2 in that case though. If we wanted to avoid growing the cardinality we'd have to pick 2 more to drop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could drop 200.0 as well since above 180 probably hit the timeout

30.0,
60.0,
120.0,
180.0,
200.0,
"+Inf",
),
Copy link
Contributor Author

@MadLittleMods MadLittleMods Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.

Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the /messages API. This would also allow us to differentiate between local and remote /messages requests for example.

-- @erikjohnston, #13478 (comment)

I can create a separate PR to add a specific metric for /messages -> #13533

But do we have any interest in adjusting the buckets for the general case? @erikjohnston mentioned if anything maybe wanting even more fidelity in the lower ranges. @richvdh do you have any interest in increasing for another endpoint? Our limiting factor is cardinality since this multiplies out to all of our servlets.

I think @MadLittleMods is right in that the top bucket should be more than 10s given how often some of our endpoints take longer than that

-- @richvdh, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$CLJ5oioD_DO1A_zSGmYtCd-yToSyA6EiOwOsClvfdcs?via=matrix.org&via=element.io&via=beeper.com

https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1660574475300&to=1660596075300&viewPanel=47

Slow servlets shown in grafana

Slow KeyChangesServlets shown in grafana

Slow FederationStateIdsServlet shown in grafana


In terms of reducing cardinality, we could remove code. I think for timing, we really just need the method and servlet name. Response code can be useful but maybe we just need to change it to a successful_response boolean (with a cardinality of 2, [true|false]) since we only ever use it as code=~"2..". Or maybe more useful as error_response: true/false so that success or timeout can still be false while an actual error would be true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richvdh do you have any interest in increasing for another endpoint?

I'm not sure really. I guess? I don't feel like it's a huge priority to me right now.

In terms of reducing cardinality, we could remove code.

Probably, yes. but that's a conversation we should have over at #11082.

)

response_ru_utime = Counter(
Expand Down