NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

bjee19 · 2023-10-02T22:15:00Z

Describe the bug
When the NGINX master process fails without cleaning up (kill -9 <nginx-master-pid>), the NGF Pod cannot recover because the new NGINX container cannot start.

To Reproduce
Steps to reproduce the behavior:

Change runAsNonRoot from true to false in deploy/manifests/nginx-gateway.yaml
Deploy and expose NGF
Insert an ephemeral container into the NGF Pod using this command: kubectl debug -it -n nginx-gateway <NGF_POD> --image=busybox:1.28 --target=nginx-gateway
Run kill -9 <nginx-master-PID> in the ephemeral container
Check the logs of the nginx container in a different terminal by running: kubectl logs -f -n nginx-gateway <NGF_POD> -c nginx

Expected behavior
The NGINX container should restart and the NGF Pod should recover.

Your environment

Version of the NGINX Gateway Fabric - "version":"edge","commit":"72b6c6ef8915c697626eeab88fdb6a3ce15b8da0"
Version of Kubernetes - 1.27
Kubernetes platform (e.g. Mini-kube or GCP) - GKE
Details on how you expose the NGINX Gateway Fabric Pod - Loadbalancer

Additional context
Log file of nginx container showing error:

The text was updated successfully, but these errors were encountered:

bjee19 · 2023-10-05T20:51:27Z

Another way to reproduce a similar error (majority of the time) is to

Have NGF deployed on a Kind cluster
Run docker restart kind-control-plane or manually restart the container in the Docker dashboard.

docker restart will send a SIGTERM signal to the processes in the container, but will forcibly shutdown after a short time. I believe that this forced restart is the same as the NGINX master process fails without cleaning up.

AlexEndris · 2024-03-06T14:49:51Z

Is this being worked on? Would it help if you'd take the fix from #1532 into the helm chart temporarily, with a flag in the values.yaml, until it's resolved in a better way?

mpstefan · 2024-03-06T20:29:12Z

Hey @AlexEndris, we actually haven't prioritized this because the only way we could cause it to happen is to kill the nginx process in the pod, which would be a highly unusual case.

I assume you are running into this issue yourself? Can you describe under what circumstances this occurs for you?

AlexEndris · 2024-03-06T20:55:03Z

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

kate-osborn · 2024-03-06T22:19:56Z

Yes. But I realise it might be a sort of edge case scenario that might not even happen. Essentially, we package a small k8s cluster using k3s. We shut it down and ship it. Upon restarting, nginx doesn't recover and we would need to manually kill the pod/restart the deployment to get it running again. The issue is, I don't have access to that, when it's shipped, and they want an out of the box experience.

Thanks for the details @AlexEndris. I added this issue to our community triage meeting agenda scheduled for Monday. We will discuss it then. If you'd like to join, the meeting info is here.

mpstefan · 2024-03-11T16:15:15Z

@AlexEndris We discussed this during our community meeting and we think we can take a look at it in our next release. We'd like to first look at the fix in #1532 to see if we can solve the problem in the code, which shouldn't be too bad.

Once we do fix it, you can pull the edge release so you can get the fix before we do another full release if you're looking for something soon.

Thanks for letting us know!

AlexEndris · 2024-03-12T12:54:24Z

Thank you very much! It's highly appreciated!

bjee19 · 2024-05-01T15:56:33Z

When completed, should remove the Skip() of It("recovers when nginx container is restarted"... added in #1832

mpstefan added this to the v1.2.0 milestone Oct 4, 2023

mpstefan added the bug Something isn't working label Oct 4, 2023

pleshakov mentioned this issue Jan 17, 2024

NGINX Plus: dynamic upstream reloads support #1469

Merged

6 tasks

pleshakov mentioned this issue Feb 2, 2024

bind() to unix:/var/run/nginx/nginx-status.sock failed (98: Address in use) #1532

Closed

mpstefan modified the milestones: v1.2.0, v2.0.0 Feb 2, 2024

sjberman mentioned this issue Apr 26, 2024

Automate Graceful Recovery NFR #1832

Merged

6 tasks

mpstefan modified the milestones: v1.3.0, v1.4.0 May 1, 2024

ciarams87 self-assigned this Jun 13, 2024

ciarams87 mentioned this issue Jun 13, 2024

Remove sock files on nginx startup #2131

Merged

6 tasks

sjberman closed this as completed in #2131 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

bjee19 commented Oct 2, 2023

bjee19 commented Oct 5, 2023

AlexEndris commented Mar 6, 2024

mpstefan commented Mar 6, 2024

AlexEndris commented Mar 6, 2024

kate-osborn commented Mar 6, 2024

mpstefan commented Mar 11, 2024

AlexEndris commented Mar 12, 2024

bjee19 commented May 1, 2024

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

NGF Pod cannot recover if NGINX master process fails without cleaning up #1108

Comments

bjee19 commented Oct 2, 2023

bjee19 commented Oct 5, 2023

AlexEndris commented Mar 6, 2024

mpstefan commented Mar 6, 2024

AlexEndris commented Mar 6, 2024

kate-osborn commented Mar 6, 2024

mpstefan commented Mar 11, 2024

AlexEndris commented Mar 12, 2024

bjee19 commented May 1, 2024