Add timeout to endpointset metric collector #7336

fpetkovski · 2024-05-06T09:17:20Z

We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel.

I added CHANGELOG entry for this change.
Change is not relevant to the end user.

Changes

Verification

We have seen deadlocks with endpoint discovery caused by the metric collector hanging and not releasing the store labels lock. This causes the endpoint update to hang, which also makes all endpoint readers hang on acquiring a read lock for the resolved endpoints slice. This commit makes sure the Collect method on the metrics collector has a built in timeout to guard against cases where an upstream call never reads from the collection channel. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

GiedriusS · 2024-05-06T13:03:31Z

pkg/query/endpointset.go

@@ -277,7 +279,12 @@ func (c *endpointSetNodeCollector) Collect(ch chan<- prometheus.Metric) {
 					lbls = append(lbls, storeTypeStr)
 				}
 			}
-			ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...)
+			select {
+			case ch <- prometheus.MustNewConstMetric(c.connectionsDesc, prometheus.GaugeValue, float64(occurrences), lbls...):


Why this can take even 1 second? 🤔

I don't yet understand why the send would block forever here, the only explanation is that the caller does not read from the channel. The 1s timeout is arbitrary though, it's there to make sure we do not end up in a deadlock.

We noticed queriers getting stuck occasionally and the goroutine profile showed that the mutex in this function was constantly locked.

I think something else might be at play here, e.g., this goroutine is starved of CPU resources. Do you have a lot of goroutines running when this happens?

There is only 1 goroutine here but there are new goroutines constantly piling up here. That is because the mutex for the collector is locked, which causes the call here to hang so this mutex also is locked.

I do not yet understand why the channel reader would stop or block though.

MichaHoffmann

It looks like a bug in the prometheus client if this is a new thing. The fix is practical though, lets merge and debug what happens later.

fpetkovski · 2024-05-21T14:50:11Z

I will merge this for now and we can dig into the root cause once we get some time.

pull-request-size bot added the size/S label May 6, 2024

pedro-stanaka approved these changes May 6, 2024

View reviewed changes

fpetkovski force-pushed the endpoint-collect-timeout branch from 09d688d to cad8f93 Compare May 6, 2024 12:31

GiedriusS reviewed May 6, 2024

View reviewed changes

MichaHoffmann approved these changes May 14, 2024

View reviewed changes

fpetkovski merged commit 9db31c2 into thanos-io:main May 21, 2024
20 checks passed

GiedriusS mentioned this pull request Jun 3, 2024

*: pull PR 7336 vinted/thanos#109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to endpointset metric collector #7336

Add timeout to endpointset metric collector #7336

fpetkovski commented May 6, 2024

GiedriusS May 6, 2024

fpetkovski May 6, 2024

GiedriusS May 9, 2024

fpetkovski May 9, 2024 •

edited

Loading

MichaHoffmann left a comment

fpetkovski commented May 21, 2024

Add timeout to endpointset metric collector #7336

Add timeout to endpointset metric collector #7336

Conversation

fpetkovski commented May 6, 2024

Changes

Verification

GiedriusS May 6, 2024

Choose a reason for hiding this comment

fpetkovski May 6, 2024

Choose a reason for hiding this comment

GiedriusS May 9, 2024

Choose a reason for hiding this comment

fpetkovski May 9, 2024 • edited Loading

Choose a reason for hiding this comment

MichaHoffmann left a comment

Choose a reason for hiding this comment

fpetkovski commented May 21, 2024

fpetkovski May 9, 2024 •

edited

Loading