Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Use strict DNS for mesh gateways with hostnames into release/1.16.x #19395

Conversation

hc-github-team-consul-core
Copy link
Contributor

Backport

This PR is auto-generated from #19268 to be assessed for backporting due to the inclusion of the label backport/1.16.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@andrewstucki
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/consul/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


Description

This fixes #17557. In an attempt to support mesh gateways fronted by AWS load balancers, a code path for peered mesh gateways was introduced in #14917 that leverages envoy clusters backed by the LOGICAL_DNS cluster discovery type. Problematically, when replicas of mesh gateways exist in a peered connection, the dialing peer will hit this code path and attempt to add multiple endpoints for the targeted mesh gateways. Envoy, however, doesn't support multiple endpoints using LOGICAL_DNS and will start spitting out errors applying the xDS it receives from Consul.

In Kubernetes, when a mesh gateway restarts then, it will never finish initializing and get marked as healthy, so its pod will continually restart and the gateway becomes unusable.

Because this requires using hostnames rather than IP addresses for the WAN addresses registered for mesh gateways, it likely impacts mostly Consul users on AWS, where hostnames are used for LoadBalancer services and thus registered for LoadBalancer type mesh gateways. This will also affect users who manually (or with annotations) register mesh gateways with mutiple FQDNs.

Note that this appears to only affect the dialing cluster in a peered connection, the accepting clusters use a different code path that only ever uses a single mesh gateway target and doesn't attempt to load-balance between multiple mesh gateways.

Testing & Reproduction steps

I was able to recreate this pretty easily outside of AWS by pinning the FQDN of the mesh gateways in the accepting cluster via something like:

meshGateway:
  enabled: true
  replicas: 2
  wanAddress:
    source: "Static"
    static: "gateway.nanosleep.cloud"

which gives this for my dialing cluster:

curl https://${DC2_CONSUL}/v1/peerings ... | jq
...
"PeerServerAddresses": [
  "gateway.nanosleep.cloud:443",
  "gateway.nanosleep.cloud:443"
],
...

And dialing cluster Consul logs then show:

2023-10-17T22:37:29.016Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=default/default/consul-consul-mesh-gateway-6ff745887b-5c5s2 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000006 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.dc1.peering.303380e1-f1a6-fb04-4ca6-c562e4951539.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"

and dialing cluster mesh gateway:

2023-10-17T22:36:20.838Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.dc1.peering.303380e1-f1a6-fb04-4ca6-c562e4951539.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint

Swapping to STRICT_DNS allows the mesh gateway to finish configuration and boot properly.

Links

Strict DNS in envoy.

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

Overview of commits

@hashicorp-cla
Copy link

hashicorp-cla commented Oct 26, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@vercel vercel bot temporarily deployed to Preview – consul October 26, 2023 20:12 Inactive
@andrewstucki andrewstucki force-pushed the backport/net-4786/mesh-strict-dns/friendly-witty-dodo branch from 3d0d19b to b8a02fe Compare October 27, 2023 15:08
@andrewstucki andrewstucki marked this pull request as ready for review October 27, 2023 15:08
@andrewstucki andrewstucki merged commit 0fd8cdb into release/1.16.x Oct 27, 2023
83 checks passed
@andrewstucki andrewstucki deleted the backport/net-4786/mesh-strict-dns/friendly-witty-dodo branch October 27, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants