Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller freezes and kube fails to restart based on liveness. #167

Closed
alewitt2 opened this issue Mar 11, 2021 · 3 comments · Fixed by #236
Closed

Controller freezes and kube fails to restart based on liveness. #167

alewitt2 opened this issue Mar 11, 2021 · 3 comments · Fixed by #236
Labels
bug Something isn't working

Comments

@alewitt2
Copy link
Member

alewitt2 commented Mar 11, 2021

Describe the bug
there are rare instances where we have seen our controller stop doing any work and there are no logs for several days, but kube hasnt restarted the pod. We are not quite sure why the liveness check dont cause our pod to be restarted, but have 1 theory that maybe we are touching the liveness file but not actually getting events from kube for some unknown reason.

ref:

To Reproduce
its intermittently rare and hard to reproduce

Expected behavior
kube should restart us based on our liveness or we should restart ourselves.

Possible Solution
we know that our watch gets recreated on an interval defined by timeoutSeconds, so we should track when we start watching in watchman.js, and if we havent recieved any data or the connection hasnt closed in (timeoutSeconds + a few buffer minutes), then we should either restart the watch within the code or just exit the process and get restarted in a new container.

@alewitt2 alewitt2 added the bug Something isn't working label Mar 11, 2021
@alewitt2
Copy link
Member Author

razee-io/Razee#135

@charlesthomas
Copy link

we've encountered this and noticed that when it happens, the logs have not been updated. what if you changed sh/liveness.sh to watch a log file instead of a separate file?

@alewitt2
Copy link
Member Author

alewitt2 commented Aug 4, 2021

thats a pretty good idea. we dont currently have a log file set up, but i imagine it wouldnt be too hard and should be an appropriate way to catch this error path. If it is still freezing after that change, we know there is something else wrong with kube that is hanging us up and stoping kube from checking our liveness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants