Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mesh-gateway fails to start with peered connections in k8s when replicas > 1 #17557

Closed
christophermichaeljohnston opened this issue Jun 2, 2023 · 7 comments · Fixed by #19268

Comments

@christophermichaeljohnston

Overview of the Issue

consul: 1.15.2
consul-k8s-control-plane: 1.1.1

When running the mesh-gateway in k8s with replicas > 1, they fail to restart (ie after a pod failure) on the establishing side of
an active peered connection. This looks to be caused by envoy rejecting the additional endpoints causing the mesh-gateway to eventually terminate with k8s attempting to restart it (restart loop). The only way to restore service is to reduce the number of mesh-gateway replicas to 1 on both sides of the peered connection.

Logs from mesh-gateway:

2023-06-02T13:56:36.666Z+00:00 [info] envoy.upstream(14) cds: add 2 cluster(s), remove 0 cluster(s)
2023-06-02T13:56:36.697Z+00:00 [info] envoy.upstream(14) cds: added/updated 0 cluster(s), skipped 1 unmodified cluster(s)
2023-06-02T13:56:36.697Z+00:00 [warning] envoy.config(14) delta config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-06-02T13:56:36.697Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-06-02T13:56:51.664Z+00:00 [warning] envoy.config(14) gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
2023-06-02T13:56:51.664Z+00:00 [info] envoy.config(14) all dependencies initialized. starting workers
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.metrics: stopping the merged  server
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.server-connection-manager: stopping
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane: context done stopping xds server
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.metrics: stopping consul dp promtheus server
2023-06-02T13:57:26.034Z [INFO]  consul-dataplane: envoy process exited: error="signal: killed"
2023-06-02T13:57:26.036Z [INFO]  consul-dataplane.server-connection-manager: ACL auth method logout succeeded

Logs from consul server:

2023-06-02T13:57:16.602Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=consul-mesh-gateway-6bfb4b5b45-2qnl5 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000001 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"
2023-06-02T13:57:16.637Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=consul-mesh-gateway-6bfb4b5b45-2qnl5 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000003 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"
2023-06-02T13:58:15.982Z [ERROR] agent.envoy: Error receiving new DeltaDiscoveryRequest; closing request channel: error="rpc error: code = Canceled desc = context canceled"
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=mesh topic=MeshConfig
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=consul topic=ServiceHealth
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=consul topic=ServiceHealthConnect
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 topic=ServiceResolver wildcard_subject=true
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 topic=ServiceList wildcard_subject=true

Reproduction Steps

  1. Stand up 2 clusters in k8s with mesh-gateway replicas > 1.
  2. Peer the clusters.
  3. Terminate a mesh-gateway on the establishing side and it should never successfully start up.
@huikang
Copy link
Contributor

huikang commented Jun 2, 2023

@christophermichaeljohnston , thanks for reporting. We will try to reproduce and investigate the issue.

@huikang
Copy link
Contributor

huikang commented Jun 5, 2023

@christophermichaeljohnston , there are some bug fixes in consul-k8s 1.1.2. I am wondering you can try the new release in your testbed and see if the issue still exists. Thanks.

@christophermichaeljohnston
Copy link
Author

@huikang confirmed that consul-k8s 1.1.2 resolves the issue. Thanks!

@david-yu
Copy link
Contributor

david-yu commented Jun 6, 2023

@christophermichaeljohnston Would it be safe to close this issue?

@christophermichaeljohnston
Copy link
Author

Yes. I'll close it. Thanks again!

@christophermichaeljohnston
Copy link
Author

christophermichaeljohnston commented Jun 28, 2023

@huikang turns out this was not resolved with consul-k8s 1.1.2

Depending on the order things start, the mesh gateway replicas will all start on both sides of the peer. But if one of the mesh gateway replicas on the peering dialer side then restarts, it fails with the logs in the original comment. I haven't figured out how to dump the configs out of envoy to confirm, but I wonder if the type should be STATIC instead of LOGICAL_DNS when replicas > 1. Some details here: envoyproxy/envoy#14848

I did try to set the following, based on https://developer.hashicorp.com/consul/docs/connect/proxies/envoy#envoy_dns_discovery_type, to see if it would change behavior but same result:

apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  config:
    envoy_dns_discovery_type: STRICT_DNS
  meshGateway:
    mode: local

I wonder if the issue is somewhere around here https://github.com/hashicorp/consul/blob/67a239a8210e200c6bff5697d511830d606333b7/agent/xds/clusters.go#L1645C21-L1645C21 using the ipaddress instead of hostname.

Might be able to replicate this by adding another endpoint into:

@christophermichaeljohnston
Copy link
Author

christophermichaeljohnston commented Jul 5, 2023

Created here instead: hashicorp/consul-k8s#2509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants