-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mesh-gateway fails to start with peered connections in k8s when replicas > 1 #17557
Comments
@christophermichaeljohnston , thanks for reporting. We will try to reproduce and investigate the issue. |
@christophermichaeljohnston , there are some bug fixes in consul-k8s 1.1.2. I am wondering you can try the new release in your testbed and see if the issue still exists. Thanks. |
@huikang confirmed that consul-k8s 1.1.2 resolves the issue. Thanks! |
@christophermichaeljohnston Would it be safe to close this issue? |
Yes. I'll close it. Thanks again! |
@huikang turns out this was not resolved with consul-k8s 1.1.2 Depending on the order things start, the mesh gateway replicas will all start on both sides of the peer. But if one of the mesh gateway replicas on the peering dialer side then restarts, it fails with the logs in the original comment. I haven't figured out how to dump the configs out of envoy to confirm, but I wonder if the type should be STATIC instead of LOGICAL_DNS when replicas > 1. Some details here: envoyproxy/envoy#14848 I did try to set the following, based on https://developer.hashicorp.com/consul/docs/connect/proxies/envoy#envoy_dns_discovery_type, to see if it would change behavior but same result:
I wonder if the issue is somewhere around here https://github.com/hashicorp/consul/blob/67a239a8210e200c6bff5697d511830d606333b7/agent/xds/clusters.go#L1645C21-L1645C21 using the ipaddress instead of hostname. Might be able to replicate this by adding another endpoint into: consul/agent/xds/testdata/clusters/mesh-gateway-with-peer-through-mesh-gateway-enabled.latest.golden Line 39 in 67a239a
|
Created here instead: hashicorp/consul-k8s#2509 |
Overview of the Issue
consul: 1.15.2
consul-k8s-control-plane: 1.1.1
When running the mesh-gateway in k8s with replicas > 1, they fail to restart (ie after a pod failure) on the establishing side of
an active peered connection. This looks to be caused by envoy rejecting the additional endpoints causing the mesh-gateway to eventually terminate with k8s attempting to restart it (restart loop). The only way to restore service is to reduce the number of mesh-gateway replicas to 1 on both sides of the peered connection.
Logs from mesh-gateway:
Logs from consul server:
Reproduction Steps
The text was updated successfully, but these errors were encountered: