Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown with injected envoy-sidecar #536

Closed
svobol13 opened this issue Jun 16, 2021 · 22 comments · Fixed by #2233
Closed

Graceful shutdown with injected envoy-sidecar #536

svobol13 opened this issue Jun 16, 2021 · 22 comments · Fixed by #2233
Labels
area/connect Related to Connect service mesh, e.g. injection type/enhancement New feature or request

Comments

@svobol13
Copy link

svobol13 commented Jun 16, 2021

Envoy proxy sidecar receives SIGTERM at the exact same moment as my main container. In oppossite to my main container (which shuts down in like 15-30 seconds) envoy sidecar shuts down immediately (0.5 - 3s). This means that my main lost upstream connections and cannot gracefully shutdown - even rolling update means lost requests/data.

There should be some kind of mechanism so the upstream listeners exits as last. My main container is Consul Connect enabled and communicate with upstreams through Connect but the service itself is not being accessed through Connect but through Consul DNS instead.

Is there a workaround/hack (some kind of prestop sleep) or do I have get rid of Consul Connect?

Similar issues:

@svobol13 svobol13 added the type/question Question about product, ideally should be pointed to discuss.hashicorp.com label Jun 16, 2021
@pedrohdz
Copy link

Have you tried adding terminationGracePeriodSeconds to your pod? You might end up with #540 instead then.

@ishustava ishustava added area/connect Related to Connect service mesh, e.g. injection type/enhancement New feature or request and removed type/question Question about product, ideally should be pointed to discuss.hashicorp.com labels Jul 22, 2021
@ishustava
Copy link
Contributor

Thanks for the issue @svobol13! This is the problem that is a larger issue in Kubernetes itself in that there is no lifecycle hooks that can help us control container shutdown. We'd need to investigate how to work around this until there's a proper solution in k8s.

@kevin-lindsay-1
Copy link

Posting a reference to a comment that I made after going down this rabbit hole, to hopefully save others some time:
istio/istio#18333 (comment)

@Joxit
Copy link

Joxit commented Dec 30, 2021

@pedrohdz I did some tests and terminationGracePeriodSeconds is not enough. As you can see in this image, the grace period starts after the SIGTERM signal, at this moment it's already too late.
image

The problem really comes from Envoy which should not stop without having completed the last received requests on SIGTERM.

My workaround was add a preStop: sleep 30s on my container + terminationGracePeriodSeconds and to overload the envoy docker image to ignore SIGTERM.

This work fine because consul remove the service from its catalog when the SIGTERM is triggered, so I have 30 second to finish the work.

Unfortunately ingress/terminating gateways suffer from the same problem and we cannot use the same workarounds... 😕 (preStop already set)

@lkysow
Copy link
Member

lkysow commented Jan 14, 2022

I think we need to do something similar to ECS: https://www.consul.io/docs/ecs/architecture#task-shutdown where we deregister the service immediately but keep Envoy running until the application container shuts down.

@dschaaff
Copy link

As a user I really need to be able to control the shutdown. In my case I have cli applications that are only using envoy for outbound connections. Some of these take 1-2 minutes to gracefully stop their current work after receiving the sigterm. During that period envoy needs to stay up and available. What happens now is we get errors because envoy shuts down very quickly. Being able to add a simple prestop hook with a sleep to envoy would make it simple for me to do this.

@dschaaff
Copy link

dschaaff commented Apr 13, 2022

Any updates on this? We just an annotation as suggested in https://github.com/hashicorp/consul-k8s/pull/911/files.

Here is how linkerd handles it https://linkerd.io/2.11/tasks/graceful-shutdown/.

I think this issue should be given really high priority as it is actually impossible to deployment in kubernetes with connection errors. The problem also arises any an HPA scales down pods.

@dschaaff
Copy link

I've resorted to running a custom built binary with a patch containing the changes here #911. It is the only way to run Consul Connect in production at present without getting 5xx errors during deployments and scale downs. The product definitely needs to look at this issue and make it a priority as this is a basic piece of it being production ready.

@david-yu
Copy link
Contributor

Hi @dschaaff thanks for the feedback. We are monitoring this issue as well, aside from other items we have targeted for our next releases tied with Consul Core that are more architecturally related. I can't definitely say when we will address this but I do want to support a native solution in Consul K8s.

@narendrapatel
Copy link

We are also facing this issue wherein our app has some draining configured. But as the pod receives SIGTERM, envoy immediately shuts down while the app is still draining. Graceful termination would be very important and helpful to us especially since its a high traffic app and any connection issues get quickly noticed and reported. As @dschaaff rightly mentioned this is important for release to production.

@dschaaff
Copy link

dschaaff commented Aug 5, 2022

I continue to build a forked image of the control plane binary for each release in order to add a prestop hook to the envoy sidecar. It's quite disappointing that this feature hasn't been added. This issue has been open for over a year and this remains a blocker to production use of consul connect.

@david-yu
Copy link
Contributor

david-yu commented Aug 6, 2022

Hi @dschaaff and @narendrapatel thanks for the feedback. I don't disagree that it is important to address and a blocker to getting to production. Right we are at a point of competing priorities due to large architecture changes within Consul that we are actively working on.

@narendrapatel
Copy link

narendrapatel commented Aug 11, 2022

Hi @dschaaff, If possible, can you please share how are you building the image. Here is what I tried:

I tested it in my local setup and confirm it is working as expected but not sure of the build process.

@dschaaff
Copy link

I use this docker file to build

FROM public.ecr.aws/docker/library/golang:1.18.4-alpine3.15 as build
ARG TARGETOS
ARG TARGETARCH

COPY . /go

RUN cd /go/control-plane && \
	set -x; go build -o pkg/bin/consul-k8s-control-plane

# final image
# we are simply copying our custom built binary over the standard binary in the image
FROM hashicorp/consul-k8s-control-plane:0.46.1

ARG TARGETOS
ARG TARGETARCH

COPY --from=build /go/control-plane/pkg/bin/ /bin

@alt-dima
Copy link

Does anyone use Consul Mesh/Connect in production?
I can't understand how it can be used as is (without patch like this) and avoid errors in application, that needs time to finish jobs?
maybe there is something new/fixed in 1.13.1?

@alt-dima
Copy link

@dschaaff Thank you for the Dockerfile!
I use it with a combination of custom-image and "dynamic entrypoint" (#1397 (comment))

@dschaaff
Copy link

dschaaff commented Dec 6, 2022

The big 1.0 rewrite has been released for a bit. Can anyone from HashiCorp comment on the timeline for fixing this issue? Due to the delay on this and other bugs we are facing we are considering dropping the Consul service mesh.

@nrichu-hcp
Copy link

@dschaaff Were taking this issue very seriously and have a solid idea of potential fixes to alleviate the problems you're having. The timeline is a little gray but with a medium, to strong probability, you will see a fix in the 2023 calendar year.

@coconut30
Copy link

Any news on this issue ?

@oliver-buckley-salmon-db
Copy link

oliver-buckley-salmon-db commented Apr 13, 2023

Hi @nrichu-hcp is there any update? We have multiple teams looking to go live in the next quarter with Consul Service Mesh, which we have spent 2 years arguing for vs Istio / ASM. This issue could easily force us to have to abandon Consul and migrate everyone to Istio/ASM.
If we could just have a quarter date, that would be great.

@dschaaff
Copy link

dschaaff commented Apr 13, 2023

We abandoned the Consul mesh after 2 years in production. We had to run a forked build of the controller to enable adding the pre stop sleep that entire time. In the end we switched to linkerd.

@oliver-buckley-salmon-db

Hi, @nrichu-hcp it looks like this PR fixes the issue, any idea when it will be merged and what release it will be in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.