Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness probe failed on version 3.3.2 #8441

Closed
omarkalloush opened this issue Apr 21, 2022 · 4 comments · Fixed by #8454
Closed

Readiness probe failed on version 3.3.2 #8441

omarkalloush opened this issue Apr 21, 2022 · 4 comments · Fixed by #8454
Assignees
Labels
area/controller Controller issues, panics area/manifests solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@omarkalloush
Copy link

Summary

What happened/what you expected to happen?

Upgraded Argo Workflows from v3.3.1 to v3.3.2 , I have 3 workflow-controller pods in the replicaset 2 of them stay stuck on unhealty state with the following error Readiness probe failed: Get "http://10.57.3.31:9090/metrics": dial tcp 10.57.3.31:9090: connect: connection refused

What version are you running?

v3.3.2

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Apr 21, 2022

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

This has changed in v3.3. Service only route traffic to ready pods. Metrics were being routed from the metrics service to the non-leader controller pods, but these cannot field the metrics request - their metrics is worse than useless. This is semantically correct - the pods are not ready to accept traffic.

These pods will pass their liveness probe.

Could you explain more what you meant by "unhealthy"?

If you need a work-around, and you do not use the metrics service, then you should remove these readiness probes.

@alexec alexec added the solution/workaround There's a workaround, might not be great, but exists label Apr 21, 2022
@acj
Copy link

acj commented Apr 22, 2022

I ran into this when upgrading from 3.2.x to 3.3.2 and wanted to share a few notes. My previous deployment used a rolling deploy strategy, which then caused the following error when I tried to apply the upgrade:

error: Deployment.apps "workflow-controller" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy type is 'Recreate'

My initial plan was to continue using the rolling deploy (instead of Recreate), which is when I ran into this issue with the readiness probe. The failing probe blocked the rolling deploy indefinitely because the first new pod never becomes ready, which might be what OP means by "unhealthy". The discussion in #8283 helped me to understand why the new strategy is needed. (Not scraping the non-leader pods fixed several confusing quirks in our metrics, so thanks for the improvement!)

To perform the upgrade, I manually edited the old deployment to use strategy type Recreate, gave it a moment to cycle, and then applied the 3.3.2 manifest. Then the new controller pods started up correctly.

@alexec alexec removed the triage label Apr 22, 2022
@alexec
Copy link
Contributor

alexec commented Apr 22, 2022

We've now seen this issue. Argo CD treats un-ready pods an un-healthy. While I think it is semantically wrong, it is correct in practice.

alexec added a commit to alexec/argo-workflows that referenced this issue Apr 22, 2022
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor

alexec commented Apr 22, 2022

I'm reverting that fix.

alexec added a commit that referenced this issue Apr 22, 2022
Signed-off-by: Alex Collins <alex_collins@intuit.com>
alexec added a commit that referenced this issue Apr 25, 2022
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/manifests solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants