Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

Closed
bjee19 opened this issue Oct 2, 2023 · 8 comments · Fixed by #2131
Closed

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

bjee19 opened this issue Oct 2, 2023 · 8 comments · Fixed by #2131
Assignees
Labels
bug Something isn't working
Milestone

Comments

@bjee19
Copy link
Contributor

bjee19 commented Oct 2, 2023

Describe the bug
When the NGINX master process fails without cleaning up (kill -9 <nginx-master-pid>), the NGF Pod cannot recover because the new NGINX container cannot start.

To Reproduce
Steps to reproduce the behavior:

  1. Change runAsNonRoot from true to false in deploy/manifests/nginx-gateway.yaml
  2. Deploy and expose NGF
  3. Insert an ephemeral container into the NGF Pod using this command: kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
  4. Run kill -9 <nginx-master-PID> in the ephemeral container
  5. Check the logs of the nginx container in a different terminal by running: kubectl logs -f -n nginx-gateway <NGF_POD> -c nginx

Expected behavior
The NGINX container should restart and the NGF Pod should recover.

Your environment

  • Version of the NGINX Gateway Fabric - "version":"edge","commit":"72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
  • Version of Kubernetes - 1.27
  • Kubernetes platform (e.g. Mini-kube or GCP) - GKE
  • Details on how you expose the NGINX Gateway Fabric Pod - Loadbalancer

Additional context
Log file of nginx container showing error:
image

@mpstefan mpstefan added this to the v1.2.0 milestone Oct 4, 2023
@mpstefan mpstefan added the bug Something isn't working label Oct 4, 2023
@bjee19
Copy link
Contributor Author

bjee19 commented Oct 5, 2023

Another way to reproduce a similar error (majority of the time) is to

  1. Have NGF deployed on a Kind cluster
  2. Run docker restart kind-control-plane or manually restart the container in the Docker dashboard.

docker restart will send a SIGTERM signal to the processes in the container, but will forcibly shutdown after a short time. I believe that this forced restart is the same as the NGINX master process fails without cleaning up.

@AlexEndris
Copy link

Is this being worked on? Would it help if you'd take the fix from #1532 into the helm chart temporarily, with a flag in the values.yaml, until it's resolved in a better way?

@mpstefan
Copy link
Collaborator

mpstefan commented Mar 6, 2024

Hey @AlexEndris, we actually haven't prioritized this because the only way we could cause it to happen is to kill the nginx process in the pod, which would be a highly unusual case.

I assume you are running into this issue yourself? Can you describe under what circumstances this occurs for you?

@AlexEndris
Copy link

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

@kate-osborn
Copy link
Contributor

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

Thanks for the details @AlexEndris. I added this issue to our community triage meeting agenda scheduled for Monday. We will discuss it then. If you'd like to join, the meeting info is here.

@mpstefan
Copy link
Collaborator

@AlexEndris We discussed this during our community meeting and we think we can take a look at it in our next release. We'd like to first look at the fix in #1532 to see if we can solve the problem in the code, which shouldn't be too bad.

Once we do fix it, you can pull the edge release so you can get the fix before we do another full release if you're looking for something soon.

Thanks for letting us know!

@AlexEndris
Copy link

Thank you very much! It's highly appreciated!

@bjee19
Copy link
Contributor Author

bjee19 commented May 1, 2024

When completed, should remove the Skip() of It("recovers when nginx container is restarted"... added in #1832

@mpstefan mpstefan modified the milestones: v1.3.0, v1.4.0 May 1, 2024
@ciarams87 ciarams87 self-assigned this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants