Consul webhook injector is not able to register services if one (or many) of worker nodes are down. #779

TomasKohout · 2021-10-12T09:06:53Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We upgraded to the latest Consul 1.10.2 and latest Helm chart and we've done a disaster scenario where we switched whole DC down. Then we faced issue where Consul Connect pods were unable to start because service was not registered in Consul.

The main issue is that in webhook injector is a function that will try to deregister services on all consul agents, but those agents are not reachable and pods are in Terminating state. After force deletion of those agents, webhook started to behave as expected.

The remedy could be to filter agent pods where it's container is not ready.

Reproduction Steps

kill worker node ungracefully
Consul connected pods will hang on init container.

Quick fix

Force delete consul agent pod which is in terminting phase: kubectl delete pod consul-agent-example --force --wait=false

Logs

2021-10-12T08:38:47.996Z	ERROR	controller.endpoints	failed to deregister endpoints on all agents	{"name": "prometheus-node-exporter", "ns": "system-monitoring", "error": "Get \"http://10.121.0.107:8500/v1/agent/services?filter=Meta%5B%22k8s-service-name%22%5D+%3D%3D+%22prometheus-node-exporter%22+and+Meta%5B%22k8s-namespace%22%5D+%3D%3D+%22system-monitoring%22+and+Meta%5B%22managed-by%22%5D+%3D%3D+%22consul-k8s-endpoints-controller%22\": dial tcp 10.121.0.107:8500: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/kohy/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:298
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/kohy/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/kohy/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:214

Expected behavior

Environment details

k8s version: 1.20.11
bare metal
Calico

The text was updated successfully, but these errors were encountered:

kschoche · 2021-10-14T18:12:17Z

Hi @TomasKohout ! Thanks for filing this issue.
Would you be able to provide more information on reproducing this? I'm a little confused on the approach because you mentioned deleting a Pod but that Pod would not exist if you'd power-cycled the node it was on.
Could you clarify which pods/nodes you restarted and what their configuration was?
Thanks!

TomasKohout · 2021-10-19T09:47:20Z

@kschoche sorry for late reply. 🙂

The issue is that if node is removed ungracefully, pods on that node will appear as Running for a short period of time (lease + toleration for not ready) and then they switch to Terminating.

The issue is that etcd contains those pods event when in Terminating phase and webhook will try to de/register on all pods of Consul agent, but agent on killed node is not reachable anymore and so injector will stuck.

When you force delete that dead consul agent, webhook injector will start to work again.

I think that I've mismatched reproduction steps and quick fix steps. Sorry for that. I've updated reproduction steps.

TomasKohout added the type/bug Something isn't working label Oct 12, 2021

TomasKohout mentioned this issue Oct 19, 2021

ISSUE-779 - check for agent status before deregistration #795

Closed

2 tasks

TomasKohout changed the title ~~Consul webhook injector is not able to register services if one (or many) of worker nodes is down.~~ Consul webhook injector is not able to register services if one (or many) of worker nodes are down. Nov 3, 2021

ishustava mentioned this issue Jan 25, 2022

connect: Avoid making unnecessary calls to Consul in the endpoints controller #991

Merged

2 tasks

ishustava closed this as completed in #991 Jan 26, 2022

dschaaff mentioned this issue Jan 31, 2022

Endpoints Controller queuing up service registrations/deregistrations when request to agent on a terminated pod does not time out #714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul webhook injector is not able to register services if one (or many) of worker nodes are down. #779

Consul webhook injector is not able to register services if one (or many) of worker nodes are down. #779

TomasKohout commented Oct 12, 2021 •

edited

Loading

kschoche commented Oct 14, 2021

TomasKohout commented Oct 19, 2021 •

edited

Loading

Consul webhook injector is not able to register services if one (or many) of worker nodes are down. #779

Consul webhook injector is not able to register services if one (or many) of worker nodes are down. #779

Comments

TomasKohout commented Oct 12, 2021 • edited Loading

Community Note

Overview of the Issue

Reproduction Steps

Quick fix

Logs

Expected behavior

Environment details

kschoche commented Oct 14, 2021

TomasKohout commented Oct 19, 2021 • edited Loading

TomasKohout commented Oct 12, 2021 •

edited

Loading

TomasKohout commented Oct 19, 2021 •

edited

Loading