-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: Start metric and status probe HTTP server as earlier as possible #1656
Conversation
2611d22
to
816c2fc
Compare
Looking at the code changes, it seems to me that the only thing that has changed is the lexicographic ordering of the function calls but that the server is still started at exactly the same time. Note that the HTTP servers are assigned to the run group and that they only actually start serving requests when the run group itself is started. |
@squat Thanks for pointing it. I somehow missed that. I'll have another hard look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions (:
Wonder if we can also add more logs to the startup as well, since you are touching this part as per: #1655 |
@bwplotka Of course, I can I'll check the related conversation. |
4fc298f
to
5bb1f5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand really why we move around all of this scheduleHTTP servers but happy with that if there is a reason (:
2 comments, otherwise LGTM! Thanks.
I think we really need to advertise now to NOT add readiness probe on metric. Users do that and now they will have false impression store that store is ready, but it actually still just loading the blocks...
Wonder how to communicate that ):
I will add more logging to initial sync in separare PR (:
5bb1f5f
to
0c784c0
Compare
@bwplotka @squat I removed unnecessary changes, fixed emphasized issues, updated CHANGELOG and store documentation. Documentation definitely needs some help, I'm open to suggestions.
I couldn't get to this one yet, however it would probably better since you have more context on the component, thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 more nits and LGTM!
c731615
to
7a1453f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 great, thanks for doing this. I remember discussing the speedup of HTTP server startup with store but it eventually did not happen. Not sure why exactly.
Agree with @bwplotka on those two comments otherwise LGTM!
b32a709
to
e88291a
Compare
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Co-Authored-By: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
e88291a
to
7004a30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🌮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Finally we have metrics on Thanos startup!
Let's make sure we have proper logging and metrics on sync - happy to attack this on Monday.
Good work @kakkoyun thanks! ❤️ |
…le (#1656) * Start metric and status probe server as soon as possible Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update changelog Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Schedule a separate goroutine to start server Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add InitSync to the rungroup Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Fix linter pointed issues Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Move InitSync to alreay existed run.Group Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove unnecessary changes and update CHANGELOG Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add simple explanation for probes Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Make requested changes Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update CHANGELOG.md Co-Authored-By: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
As a result of a couple of issues that we had by running Thanos Store with liveness probes on Kubernetes, this PR attempts to fix the liveness probe issues by moving them earlier in the start-up sequence. Now, metrics and status probe HTTP server starts as earlier as possible and returns success from
/-/healthy
endpoint and sets its status ready when everything else properly bootstrapped.A good read on the subject: https://srcco.de/posts/kubernetes-liveness-probes-are-dangerous.html
Changes
Verification
make local-test
MINIO_ENABLED=1 ./scripts/quickstart.sh
while debug log enabled, and observed start sequence.Thanos Store start sequence logs: