-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Allow synapse_http_server_response_time_seconds
Grafana histogram quantiles to show values bigger than 10s
#13478
Conversation
synapse/http/request_metrics.py
Outdated
0.005, | ||
0.01, | ||
0.025, | ||
0.05, | ||
0.075, | ||
0.1, | ||
0.25, | ||
0.5, | ||
0.75, | ||
1.0, | ||
2.5, | ||
5.0, | ||
7.5, | ||
10.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section matches the default buckets: https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544 (0.005
- 10.0
)
I chose the default as a base because that is what it was using before. Do we want to tune these or eliminate any to reduce cardinality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding ~30% more buckets seems like a step in the wrong direction for #11082.
I've just noticed this comment. Perhaps we could drop the 0.075, 0.75 and 7.5 metrics? Then the remaining ones would be separated by roughly factors of two.
We'd still be growing the number of buckets by 2 in that case though. If we wanted to avoid growing the cardinality we'd have to pick 2 more to drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could drop 200.0
as well since above 180
probably hit the timeout
synapse_http_server_response_time_seconds
Grafana histogram quantiles to show values bigger than 10s
I'm not sure if we really want this. There have been complaints in the past that the Adding ~30% more buckets seems like a step in the wrong direction for #11082. Is there a particular insight we're hoping to gain by raising the 10 second cap? |
@@ -43,6 +43,28 @@ | |||
"synapse_http_server_response_time_seconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular insight we're hoping to gain by raising the 10 second cap?
I'm trying to optimize the slow /messages
requests, #13356, specifically those that take more than 10s.
In order to track progress there, I'd like the metrics to capture them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We only add one extra bucket overall, which isn't too bad.
Hang on two secs, I'm a bit concerned by removing the 75 buckets in terms of losing the definition in the common cases |
Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate. Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.f. comment about losing fidelity
120.0, | ||
180.0, | ||
"+Inf", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not jumping in on this sooner, but: the vast majority of our APIs are responding within the range of 0-10s, so losing fidelity there reduces our insights into response time. There is quite a big difference between APIs that return in 500ms and those that return in 1s, and removing the 750ms means we can't easily differentiate.
Since this a thing that we're adding specifically to measure progress in performance improvements for a particular API, I'm very tempted to suggest that we simply create a separate metric for the
/messages
API. This would also allow us to differentiate between local and remote/messages
requests for example.
I can create a separate PR to add a specific metric for /messages
-> #13533
But do we have any interest in adjusting the buckets for the general case? @erikjohnston mentioned if anything maybe wanting even more fidelity in the lower ranges. @richvdh do you have any interest in increasing for another endpoint? Our limiting factor is cardinality since this multiplies out to all of our servlets.
I think @MadLittleMods is right in that the top bucket should be more than 10s given how often some of our endpoints take longer than that
-- @richvdh, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$CLJ5oioD_DO1A_zSGmYtCd-yToSyA6EiOwOsClvfdcs?via=matrix.org&via=element.io&via=beeper.com
In terms of reducing cardinality, we could remove code
. I think for timing, we really just need the method and servlet name. Response code
can be useful but maybe we just need to change it to a successful_response
boolean (with a cardinality of 2, [true|false]
) since we only ever use it as code=~"2.."
. Or maybe more useful as error_response: true/false
so that success or timeout can still be false
while an actual error would be true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow
synapse_http_server_response_time_seconds
Grafana histogram quantiles to show values bigger than 10sPart of #13356
Before
Purple line
>99%
percentile has false max ceiling of10s
because the values don't go above10
.https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&var-datasource=default&var-bucket_size=%24__auto_interval_bucket_size&var-instance=matrix.org&var-job=synapse_client_reader&var-index=All&from=1660039325520&to=1660060925520&viewPanel=152
After
I don't know if this actually fixes it (haven't tested).
Dev notes
Docs:
https://github.com/prometheus/client_python/blob/5a5261dd45d65914b5e3d8225b94d6e0578882f3/prometheus_client/metrics.py#L544
synapse_http_server_response_time_seconds_bucket
synapse_http_server_response_time_seconds_sum
synapse_http_server_response_time_seconds_count
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.Pull request includes a sign off(run the linters)