-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout to endpointset metric collector #7336
Add timeout to endpointset metric collector #7336
Conversation
We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice. This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
09d688d
to
cad8f93
Compare
@@ -277,7 +279,12 @@ func (c *endpointSetNodeCollector) Collect(ch chan<- prometheus.Metric) { | |||
lbls = append(lbls, storeTypeStr) | |||
} | |||
} | |||
ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...) | |||
select { | |||
case ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this can take even 1 second? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't yet understand why the send would block forever here, the only explanation is that the caller does not read from the channel. The 1s timeout is arbitrary though, it's there to make sure we do not end up in a deadlock.
We noticed queriers getting stuck occasionally and the goroutine profile showed that the mutex in this function was constantly locked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something else might be at play here, e.g., this goroutine is starved of CPU resources. Do you have a lot of goroutines running when this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a bug in the prometheus client if this is a new thing. The fix is practical though, lets merge and debug what happens later.
I will merge this for now and we can dig into the root cause once we get some time. |
We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice.
This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel.
Changes
Verification