Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial delay seconds to brig liveness and readiness probes #2878

Merged
merged 3 commits into from
Dec 13, 2022

Conversation

amitsagtani97
Copy link
Contributor

Ticket - https://wearezeta.atlassian.net/browse/SQPIT-491

Brig pods continuously restarts in case of liveness and readiness probes failing on the startup in the low compute resource environment.

To reproduce -

  • reduce the memory and cpu resource limits in charts/brig/values.yaml and deploy the chart.
  • Components of brig will take time to start which is greater than the wait time for before running first liveness probe, hence the pod is marked as failed and restarted again and again.

Added a initial delay before running first liveness and readiness probe, which allows all the brig containers to start on a low resource environment before running the probes.

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

@amitsagtani97 amitsagtani97 temporarily deployed to cachix November 29, 2022 08:18 Inactive
@amitsagtani97 amitsagtani97 temporarily deployed to cachix November 29, 2022 08:18 Inactive
@zebot zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Nov 29, 2022
charts/brig/templates/deployment.yaml Outdated Show resolved Hide resolved
@@ -138,11 +138,13 @@ spec:
scheme: HTTP
path: /i/status
port: {{ .Values.service.internalPort }}
initialDelaySeconds: 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your problem would be better solved with a startup probe, see also https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

startupProbe:
  httpGet:
   ...
   failureThreshold: 6
   periodSeconds: 5

That would wait for 30 seconds before moving over to the liveness probe which restarts brig.

Overall, if /i/status fails, however, then this begs the question if the installation works correctly otherwise? If brig doesn't have enough resources likely users will not have adequate latencies either.

Also, could it be that something else regarding networking is not working as it should, rather than this just being a resource problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, yes the startupProbe make more sense here, added it.

Yeah, there can be issues due to high network latencies as well, there is a argument "timeoutSeconds", which has default value of 1 second. Maybe, we can increase that as well for these probes.

@amitsagtani97 amitsagtani97 changed the title add initial delay secondes to brig liveness and readiness probes add initial delay seconds to brig liveness and readiness probes Dec 7, 2022
@amitsagtani97 amitsagtani97 temporarily deployed to cachix December 7, 2022 08:26 — with GitHub Actions Inactive
@amitsagtani97 amitsagtani97 temporarily deployed to cachix December 7, 2022 08:26 — with GitHub Actions Inactive
@jschaul
Copy link
Member

jschaul commented Dec 13, 2022

@amitsagtani97 PR looks good now; could you add one line to a changelog file (maybe under internal) then this PR is good to be (squash-) merged.

@amitsagtani97 amitsagtani97 temporarily deployed to cachix December 13, 2022 11:39 — with GitHub Actions Inactive
@amitsagtani97 amitsagtani97 temporarily deployed to cachix December 13, 2022 11:39 — with GitHub Actions Inactive
@amitsagtani97 amitsagtani97 merged commit b025d13 into develop Dec 13, 2022
@amitsagtani97 amitsagtani97 deleted the update_brig_helm_chart branch December 13, 2022 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants