Graceful shutdown is not working as expected with default setup. #4002

davem-git · 2024-08-05T18:13:41Z

Description:
I'm working on implementing envoy-gateway as a replacement for our nginx controller. I have some basic tests, a pod that returns a json block when hit an endpoint. Using K6 as a testing sweet. I set up the following test.

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 }, // ramp-up to 50 users
    { duration: '6m', target: 50 }, // stay at 50 users
    { duration: '2m', target: 0 },  // ramp-down to 0 users
  ],
};

export default function () {
  http.get(<url>/);
  sleep(1);
}

When I run this test and start a rollout restart of the envoy pods. I get the following errors

WARN[0070] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49547-><valid public address>:443: read: connection reset by peer"
WARN[0070] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49570-><valid public address>:443: read: connection reset by peer"
WARN[0071] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49587-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49573-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49601-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49594-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49555-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49595-><valid public address>:443: read: connection reset by peer"
WARN[0080] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49589-><valid public address>:443: read: connection reset by peer"
WARN[0082] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49572-><valid public address>:443: read: connection reset by peer"
WARN[0082] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49563-><valid public address>:443: read: connection reset by peer"
WARN[0083] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49566-><valid public address>:443: read: connection reset by peer"
WARN[0488] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49619-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49636-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49623-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49667-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49611-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49604-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49637-><valid public address>:443: read: connection reset by peer"
WARN[0489] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49649-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49642-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49624-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49598-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49830-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49634-><valid public address>:443: read: connection reset by peer"
WARN[0498] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49617-><valid public address>:443: read: connection reset by peer"
WARN[0502] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49643-><valid public address>:443: read: connection reset by peer"
WARN[0502] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49651-><valid public address>:443: read: connection reset by peer"
WARN[0503] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49630-><valid public address>:443: read: connection reset by peer"
WARN[0503] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49660-><valid public address>:443: read: connection reset by peer"
WARN[0503] Request Failed                                error="Get \"<url>": read tcp 192.168.1.99:49613-><valid public address>:443: read: connection reset by peer"

[optional *Relevant Links*:]

When I do this on nginx I do not get these errors.

I added these to my custom proxy config and it seemed to fix the issue
``sh
shutdown:
drainTimeout: 600s
minDrainDuration: 60s

However there's no documentation on this. I happened to find it with kube-explain

I'm on v1.0.1
>Any extra documentation required to understand the issue.

The text was updated successfully, but these errors were encountered:

arkodg · 2024-08-05T18:29:53Z

@davem-git which L3/L4 load balancer are you using ? have you setup health checks ?

davem-git · 2024-08-05T18:34:57Z

I'm hosted in azure and gcp. Those test ended with the similar results on both clouds

arkodg · 2024-08-05T18:41:06Z

you'll need to setup health checks on the load balancer so they can stop routing to the envoy that is shutting down (draining), the first step when shutting dowm is failing health checks so the LB can route newer connections to a newer envoy pod

either use the global one in envoy (/ready on port 19001)
or a listener specific one (only available on v1.1) https://gateway.envoyproxy.io/docs/api/extension_types/#healthchecksettings

davem-git · 2024-08-05T19:25:11Z

How are these set?

davem-git · 2024-08-06T14:30:35Z

any documentation? envoy gateway stands up these load balancers. I see I can set annotations, but I don't see any annotations for health checks for gcp.

arkodg · 2024-08-08T23:06:28Z

I do see some health check info for

arkodg · 2024-08-09T22:15:28Z

hey @davem-git we just merged #4021 that should help you with graceful shutdown on GKE with the default settings (no settings), you can try it out by using the v0.0.0-latest tag of. the helm chart

arkodg · 2024-08-15T01:35:19Z

hey @davem-git did you get a chance to try this out ?

davem-git · 2024-08-15T14:35:22Z

I've looked around and haven't found those settings for all of our cloud providers. I will test it out on azure today. So far though adding those settings I listed above seemed to have helped from my testing

arkodg · 2024-08-15T22:48:57Z

@davem-git with #4021 (which is now available with v0.0.0-latest) you may not need explicit health checks because we've reduced the time to detect failure for the envoy endpoints

ncsham · 2024-08-27T06:59:10Z

hey @arkodg , just a basic doubt

is there no need for adding pod readiness gates on namespace where envoy proxies are running(proxies receiving traffic via nlb or alb) like for eg. on eks with aws-load-balancer-controller running, ref: https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/pod_readiness_gate/

davem-git · 2024-08-27T15:28:58Z

I tried testing this, The health checks failed to work. Envoy-proxy didn't function. I reverted

arkodg · 2024-08-27T18:36:08Z

hey @arkodg , just a basic doubt

is there no need for adding pod readiness gates on namespace where envoy proxies are running(proxies receiving traffic via nlb or alb) like for eg. on eks with aws-load-balancer-controller running, ref: https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/pod_readiness_gate/

@ncsham afaik the controller is reading the endpoint slices of the service from the API server and will detect any endpoints that are down (whose readinessProbe has failed) and shouldnt route to them

github-actions · 2024-09-26T20:02:22Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

davem-git added the triage label Aug 5, 2024

davem-git mentioned this issue Aug 5, 2024

docs: Graceful shutdown for Envoy proxy #2686

Open

arkodg mentioned this issue Aug 9, 2024

reduce readinessProbe failureThreshold and periodSeconds #4021

Merged

github-actions bot added the stale label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown is not working as expected with default setup. #4002

Graceful shutdown is not working as expected with default setup. #4002

davem-git commented Aug 5, 2024

arkodg commented Aug 5, 2024

davem-git commented Aug 5, 2024

arkodg commented Aug 5, 2024

davem-git commented Aug 5, 2024

davem-git commented Aug 6, 2024

arkodg commented Aug 8, 2024

arkodg commented Aug 9, 2024 •

edited

Loading

arkodg commented Aug 15, 2024

davem-git commented Aug 15, 2024

arkodg commented Aug 15, 2024

ncsham commented Aug 27, 2024 •

edited

Loading

davem-git commented Aug 27, 2024

arkodg commented Aug 27, 2024

github-actions bot commented Sep 26, 2024

Graceful shutdown is not working as expected with default setup. #4002

Graceful shutdown is not working as expected with default setup. #4002

Comments

davem-git commented Aug 5, 2024

arkodg commented Aug 5, 2024

davem-git commented Aug 5, 2024

arkodg commented Aug 5, 2024

davem-git commented Aug 5, 2024

davem-git commented Aug 6, 2024

arkodg commented Aug 8, 2024

arkodg commented Aug 9, 2024 • edited Loading

arkodg commented Aug 15, 2024

davem-git commented Aug 15, 2024

arkodg commented Aug 15, 2024

ncsham commented Aug 27, 2024 • edited Loading

davem-git commented Aug 27, 2024

arkodg commented Aug 27, 2024

github-actions bot commented Sep 26, 2024

arkodg commented Aug 9, 2024 •

edited

Loading

ncsham commented Aug 27, 2024 •

edited

Loading