Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready state handling of sidecar #1677

Closed
jabbrwcky opened this issue Oct 23, 2019 · 7 comments
Closed

Ready state handling of sidecar #1677

jabbrwcky opened this issue Oct 23, 2019 · 7 comments

Comments

@jabbrwcky
Copy link
Contributor

jabbrwcky commented Oct 23, 2019

We experienced some bumps upgrading to thanos 0.8.1 (docker image).

Looking into sidecar I noted that the ready state of the sidecar is only set in the initial loading of prometheus labels:

err := runutil.Retry(2*time.Second, ctx.Done(), func() error {
if err := m.UpdateLabels(ctx, logger); err != nil {
level.Warn(logger).Log(
"msg", "failed to fetch initial external labels. Is Prometheus running? Retrying",
"err", err,
)
promUp.Set(0)
statusProber.SetNotReady(err)
return err
}
level.Info(logger).Log(
"msg", "successfully loaded prometheus external labels",
"external_labels", m.Labels().String(),
)
promUp.Set(1)
statusProber.SetReady()
lastHeartbeat.Set(float64(time.Now().UnixNano()) / 1e9)
return nil
})

The recurring check updates the 'prometheus_up' metric, but not the sidecar ready state:

return runutil.Repeat(30*time.Second, ctx.Done(), func() error {
iterCtx, iterCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer iterCancel()
if err := m.UpdateLabels(iterCtx, logger); err != nil {
level.Warn(logger).Log("msg", "heartbeat failed", "err", err)
promUp.Set(0)
} else {
promUp.Set(1)
lastHeartbeat.Set(float64(time.Now().UnixNano()) / 1e9)
}
return nil
})

Is this intentional?

I assume when prometheus is considered non-healthy/-ready, the sidecar should report the same.

Please correct me if I am wrong.

@kakkoyun
Copy link
Member

Hey @jabbrwcky, I have worked on related issues, looks like I have missed this part. I'll have a look at it.

@povilasv
Copy link
Member

I think this was intentional #1395 (comment) as we had a discussion on slack https://cloud-native.slack.com/archives/CL25937SP/p1565945595078500

@bwplotka
Copy link
Member

Yes, readiness it's only assumed for the initial readiness. Blips are not changing the healthiness probe as there is not much that sidecar can do, how restart/container restart would help in this case (healthiness probe is for this case).

Also I don't see the difference in logic vs pre 0.8.1. What exactly you would expect here? (:

@jabbrwcky
Copy link
Contributor Author

We experienced complete failures of queries on the querier because one prometheus instance stopped responding to HTTP request (i/o timeouts) and the querier included the prometheus/sidecar as source.

I would expect that the querier would return a partial result in such a case. Do I understand this correctly?

@bwplotka
Copy link
Member

That's correct. This logic can be controlled by query.partial-response. Do you have it set to false maybe now?

@jabbrwcky
Copy link
Contributor Author

It should be true. I cannot reproduce it at the moment because the prometheus instance in question is behaving at the moment.

So if query.partial-response should take care of this I'll verify that we have set it to true in the configs.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this as completed Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants