Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

There doesn't appear to be a way to create an API Gateway, or Gateway per cluster in a federated WAN #300

Closed
codex70 opened this issue Aug 8, 2022 · 38 comments
Labels
type/enhancement New feature or request

Comments

@codex70
Copy link

codex70 commented Aug 8, 2022

Overview of the Issue

I don't seem to be able to set up API gateway in such a way that I can either have access to all mesh services from a single API Gateway, or using and API Gateway per cluster.

Reproduction Steps

  1. Set up an initial cluster using HELM charts and creating an API Gateway (this all works as expected)
  2. Set up a second federated cluster following the instructions here: https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes
  3. Services in the second datacenter are not accessible to the API Gateway created in the first datacenter cluster.
  4. Using the federated setup, creating a new API Gateway to access services in the second datacenter fail with SSL connection issues.

Logs

Error when trying to add mesh service from second cluster to API Gateway in first cluster

k get httproute/test-service-route -n test -o jsonpath='{.status}' | jq
{
  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-08T07:38:16Z",
          "message": "k8s: service test/test-service not found",
          "observedGeneration": 2,
          "reason": "ServiceNotFound",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }
  ]
}

Error when trying to connect to a second API Gateway in the second datacenter cluster.

curl -vvi -k --header "Host: test-service.api.gateway" "https://${API}:8443/
* TCP_NODELAY set
* Connected to X.X.X.X (X.X.X.X) port 8443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to X.X.X.X:8445

Expected behavior

There is a documented solution for setting up API Gateways across federated clusters.

Environment details

Additional Context

I suspect this is a simple case of me not seeing the specific documentation required to set this up correctly, but I'm having a lot of problems getting the API Gateway up and running across multiple clusters.

@mikemorris
Copy link
Contributor

mikemorris commented Aug 8, 2022

  1. First, are you using the kind: MeshService backend? The ResolvedRefs status condition you're seeing seems to indicate a failure to resolve a Kubernetes Service named test-service in the test namespace, rather than a Consul service - the standard kind: Service Route backend will only find Kuberentes Services in the same Kubernetes cluster, not Consul services outside the Kubernetes cluster to which Consul API Gateway is deployed.

  2. While this doesn't seem to be documented, I believe the functionality of forwarding traffic to Consul services in other datacenters is not yet supported. Consul service resolution from MeshService uses findCatalogService and doesn't specify a Datacenter parameter for api.QueryOptions, which I believe would limit results to Consul services registered in the same datacenter as the Consul agent serving the API request. If you're trying to reach a service from a different Kubernetes cluster registered in the same Consul datacenter though, this may work, but I haven't tested to confirm.

    services, _, err := r.consul.Catalog().Service(consulName, "", &api.QueryOptions{
    Namespace: consulNamespace,
    })

    If using Consul Enterprise, the Consul namespace will be inferred from the connectInject.consulNamespaces configuration, for Consul OSS deployments it will be the default namespace.

  3. I'm not quite sure what would be causing the TLS error when attempting to deploy an API Gateway in a secondary datacenter, but I believe that functionality is likewise not yet supported.

@mikemorris mikemorris added the type/enhancement New feature or request label Aug 8, 2022
@codex70
Copy link
Author

codex70 commented Aug 8, 2022

Thanks for getting back to me about this, it definitely helps explain what's going on. I did try MeshService, but it complained about the type (will check the error message, but I suspect I need to apply the following: https://github.com/hashicorp/consul-api-gateway/blob/main/config/crd/bases/api-gateway.consul.hashicorp.com_meshservices.yaml)

I will investigate this in more detail tomorrow and let you know how I get on. I have two options one is the Single Consul Datacenter in Multiple Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/deployment-configurations/single-dc-multi-k8s) and the other Federation Between Kubernetes Clusters (https://www.consul.io/docs/k8s/installation/multi-cluster/kubernetes). I have managed to get either option working with varying degrees of success for cross cluster and service mesh communication.

Anyway, I will do more testing and update the thread tomorrow.

@mikemorris
Copy link
Contributor

Missing CRD would definitely explain not being able to use MeshService, make sure you're installing the CRDs as described at https://www.consul.io/docs/api-gateway/consul-api-gateway-install#installation to get Consul API Gateway's custom CRDs (such as MeshService) in addition to the upstream Gateway API CRDs.

Definitely let us know how anything you manage to get working, and we'll consider proper support for federated services as a feature for our roadmap.

@codex70
Copy link
Author

codex70 commented Aug 10, 2022

@mikemorris , I was hoping to have a look at this, but realised that whatever configuration changes I have made, the cross cluster service mesh connection through the mesh gateway is now broken for Kafka. I was running kafka inside the service mesh and it was working. I've tried to roll back my changes but can't get it working again. It seems difficult for me to debug the issue. Is it work mentioning it here, open another ticket, or is there a better place to seek support for the mesh gateway?

@codex70
Copy link
Author

codex70 commented Aug 10, 2022

By the way, I checked the CRDs, I had installed, but for a previous version, perhaps that will fix some of the issues: As for the kafka problem, I've opened a separate issue as it's something very different:
hashicorp/consul#14125
I will get back to you about this as soon as the kafka issue is fixed.

@mikemorris
Copy link
Contributor

mikemorris commented Aug 16, 2022

Looks like hashicorp/consul-k8s#1344 is tracking the issue currently preventing creation of a Gateway in secondary datacenters in a WAN-federated Consul deployment.

@codex70
Copy link
Author

codex70 commented Aug 16, 2022

Thanks @mikemorris, as you can see I've added my comment there as well. I've also fixed the issue I had with implementing kafka which now frees me up to do some more testing on the API gateway

@codex70
Copy link
Author

codex70 commented Aug 17, 2022

@mikemorris I've now been able to do some more testing, if I add in kind: MeshService I get the following error when looking at the route's status:

  "parents": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "1 error occurred:\n\t* route is in an invalid state and cannot bind\n\n",
          "observedGeneration": 2,
          "reason": "BindError",
          "status": "False",
          "type": "Accepted"
        },
        {
          "lastTransitionTime": "2022-08-17T10:33:01Z",
          "message": "unsupported reference type",
          "observedGeneration": 2,
          "reason": "Errors",
          "status": "False",
          "type": "ResolvedRefs"
        }
      ],
      "controllerName": "hashicorp.com/consul-api-gateway-controller",
      "parentRef": {
        "group": "gateway.networking.k8s.io",
        "kind": "Gateway",
        "name": "api-gateway",
        "namespace": "consul"
      }
    }

@codex70
Copy link
Author

codex70 commented Aug 17, 2022

More importantly though, is there a way of debugging an HttpRoute? I've currently only got one route that's working, the second route looks like everything is correct, but when I try to curl the endpoint, it returns a 404 error. I can't see anything in any of the logs to tell me where the error is.

@mikemorris
Copy link
Contributor

mikemorris commented Aug 23, 2022

More importantly though, is there a way of debugging an HttpRoute?

How you've been doing it so far is correct - first checking the route status field, then controller logs - if something isn't implemented correctly it may be helpful to dump the actual applied Envoy config, but this should be enough to debug most cases (and when it's not, we could likely benefit from contributions improving status messages, logs, or docs).

A route is only "applied/in effect" when its type: Accepted condition has status: True (hence the 404 for no match), and would only successfully route to a backend when type: ResolvedRefs also has status: True.

if I add in kind: MeshService I get the following error when looking at the route's status:

"message": "unsupported reference type",
"status": "False",
"type": "ResolvedRefs"

In addition to specifying kind: MeshService, it would also be necessary to set group: api-gateway.consul.hashicorp.com in that BackendRef, as Group will default to the core API group of kind: Service if unspecified (the mismatch is causing the unsupported reference type error message - it's looking for a MeshService kind in the core API group, where it doesn't exist - if the CRD was installed, it should exist in our implementation-specific group).

This is documented in the Routes configuration docs, but should probably be mentioned in MeshService too.

@nathancoleman
Copy link
Member

nathancoleman commented Sep 29, 2022

@codex70 @manobi I recorded a demo yesterday pulling together the 3 related PRs that will be included across the upcoming consul-k8s v0.49.0 and consul-api-gateway v0.5.0 releases to support Gateway per cluster in a federated setup:

Note This adds support for a Gateway in the secondary datacenter routing to services within the same datacenter. This does not add support for routing from a Gateway in one datacenter to services in another datacenter. This is now reflected in our docs which will be updated again when the releases referenced above are completed.

CAPIGW.in.Secondary.Datacenter.720p.mp4

@manobi
Copy link

manobi commented Sep 29, 2022

@nathancoleman I'll try this soon, thank you for sharing.

@manobi
Copy link

manobi commented Oct 2, 2022

@nathancoleman I've tried with consul-k8s (0.49.0) and hashicorppreview/consul-api-gateway:0.5-dev but still:

2022-10-02T00:09:03.658Z [ERROR] consul/certmanager.go:257: consul-api-gateway-server.cert-manager: error grabbing leaf certificate: error="Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID 'REDACTED' lacks permission 'service:write' on \"consul-api-gateway-controller\")"

This is what it looks like in consul ui on "DC2" (AcessorIDs and datacenter name have being redacted):
Screen Shot 2022-10-01 at 22 43 44
Screen Shot 2022-10-01 at 22 40 16
Screen Shot 2022-10-01 at 22 40 24

PS: my DC1 is still running consul-k8s v0.48.0 and many federated datacenters connected (31) each in a different version.

@nathancoleman
Copy link
Member

nathancoleman commented Oct 3, 2022

Hi @manobi 👋
I was able to get everything working w/ fresh clusters/datacenters using 0.48.0 for the primary dc and 0.49.0 for the secondary dc. I do notice though that the role for the controller in my case has a policy attached where yours does not. I'm looking into how this could have come to be in your case. Does an analogous policy (api-gateway-controller-policy-<dc_name>) exist in your UI and just isn't attached to the role, or does the policy not exist at all?

PS: any chance you could share your values.yaml files? Also curious if you did an upgrade with the Gateway already existing in your K8s cluster from when you had consul-k8s 0.48.0 installed, or did you recreate it after installing 0.49.0?

image

@manobi
Copy link

manobi commented Oct 3, 2022

Hi @nathancoleman
The policy does exists and when the secondary datacenter was created there was already a registered Gateway in primary dc (v0.48.0).

Screen Shot 2022-10-03 at 16 34 57

apiGateway:
  enabled: true
  image: hashicorppreview/consul-api-gateway:0.5-dev
  managedGatewayClass:
    copyAnnotations:
      service:
        annotations: |
          - service.beta.kubernetes.io/aws-load-balancer-backend-protocol
          - service.beta.kubernetes.io/aws-load-balancer-name
          - service.beta.kubernetes.io/aws-load-balancer-nlb-target-type
          - service.beta.kubernetes.io/aws-load-balancer-scheme
          - service.beta.kubernetes.io/aws-load-balancer-type
          - service.beta.kubernetes.io/aws-load-balancer-ssl-cert
client:
  extraConfig: |
    {
      "leave_on_terminate": true,
      "advertise_reconnect_timeout": "60s",
      "limits": {
        "http_max_conns_per_client": 65535
      }
    }
  priorityClassName: heaviest
  resources:
    limits:
      cpu: 100m
      memory: 350Mi
    requests:
      cpu: 20m
      memory: 200Mi
connectInject:
  default: false
  enabled: true
  metrics:
    defaultEnableMerging: false
    defaultEnabled: false
  resources:
    limits:
      cpu: 50m
      memory: 180Mi
    requests:
      cpu: 50m
      memory: 180Mi
  sidecarProxy:
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 13m
        memory: 81Mi
controller:
  enabled: true
  resources:
    limits:
      cpu: 100m
      memory: 50Mi
    requests:
      cpu: 100m
      memory: 50Mi
global:
  acls:
    createReplicationToken: false
    manageSystemACLs: true
    replicationToken:
      secretKey: replicationToken
      secretName: consul-consul-federation
  consulAPITimeout: 5m
  datacenter: qa-ecommerce
  enableGatewayMetrics: true
  federation:
    enabled: true
    k8sAuthMethodHost: <REDACTED>
    primaryDatacenter: dc1
  metrics:
    agentMetricsRetentionTime: 1m
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enableGatewayMetrics: true
    enabled: true
  tls:
    caCert:
      secretKey: caCert
      secretName: consul-consul-federation
    caKey:
      secretKey: caKey
      secretName: consul-consul-federation
    enabled: true
ingressGateways:
  defaults:
    service:
      annotations: |
        "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-ingress-gate"
        "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
        "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
        "service.beta.kubernetes.io/aws-load-balancer-ssl-cert": ""
        "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
      ports:
      - nodePort: null
        port: 443
      type: LoadBalancer
  enabled: false
  gateways:
  - name: ingress-gateway
  resources:
    limits:
      cpu: 400m
      memory: 150Mi
    requests:
      cpu: 160m
      memory: 100Mi
meshGateway:
  enabled: true
  replicas: 1
  resources:
    limits:
      cpu: 300m
      memory: 100Mi
    requests:
      cpu: 100m
      memory: 100Mi
  service:
    annotations: |
      "service.beta.kubernetes.io/aws-load-balancer-backend-protocol": "ssl"
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"
      "service.beta.kubernetes.io/aws-load-balancer-name": "qa-ecommerce-consul-mesh-gateway"
      "service.beta.kubernetes.io/aws-load-balancer-nlb-target-type": "ip"
      "service.beta.kubernetes.io/aws-load-balancer-scheme": "internal"
      "service.beta.kubernetes.io/aws-load-balancer-type": "nlb-ip"
server:
  extraConfig: |
    {
      "ui_config": {
        "enabled": true,
        "metrics_provider": "prometheus",
        "metrics_proxy": {
          "base_url": "http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
        },
        "dashboard_url_templates": {
          "service": "<redacted>"
        }
      }
    }
  extraVolumes:
  - items:
    - key: serverConfigJSON
      path: config.json
    load: true
    name: consul-consul-federation
    type: secret
  nodeSelector: ""
  priorityClassName: heavy
  resources:
    limits:
      cpu: 500m
      memory: 700Mi
    requests:
      cpu: 250m
      memory: 400Mi
ui:
  metrics:
    baseURL: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
    enabled: true
    provider: prometheus

@nathancoleman
Copy link
Member

@manobi if you apply that policy to the role analogous to the one I screenshotted, does everything work for you standing up a Gateway in the secondary dc?

@manobi
Copy link

manobi commented Oct 3, 2022

@nathancoleman From the UI it's not working, the browser crashes while loading the policy options. Maybe there is too much roles/policies and the same error happens during tokens bootstrap?

consul acl policy list -token=<redacted> | grep ID | wc -l
252
consul acl role update -id=16382188-2b3f-a628-a434-af342bf2f97e -policy-id=d1acd2a4-bffc-7ddf-63b5-14af3f338417 -token=<redacted>

After that the consul-api-gateway-controller seems to be running, but how I can make sure it will work the next time I upgrade?

@nathancoleman
Copy link
Member

nathancoleman commented Oct 3, 2022

@manobi I'm hoping to understand why it failed in this case. Any chance you have the logs from the consul-api-gateway-controller pod's api-gateway-controller-acl-init container when this failed? It seems like the logic to bind the policy to the role here failed

@manobi
Copy link

manobi commented Oct 3, 2022

Even after the manual attachment the api-gateway-controller-acl-init failed twice, before started running with the following logs:

2022-10-03T20:14:33.393Z [INFO] Consul login complete
2022-10-03T20:14:33.393Z [INFO] Checking that the ACL token exists when reading it in the stale consistency mode
2022-10-03T20:14:33.394Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.497Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.598Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.701Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.803Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:33.905Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.008Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.110Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.214Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.316Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.418Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.520Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.623Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.725Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2022-10-03T20:14:34.827Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"

I've noticed a similar behaviour with mesh-gateway and controller components as well.
After your direction and the UI crashing I'm starting to believe it is skipping the binding rules list somehow, when there are many items to process.

Might be not related to api-gateway but some consul-k8s bug.

@nathancoleman
Copy link
Member

@manobi that would make sense as the possible cause. That scale is the main difference between my temporary setups and your own. I'll be traveling most of this week but will see if I can find out anything once I'm back.

@mikemorris
Copy link
Contributor

mikemorris commented Oct 4, 2022

The 403 (ACL not found) errors look like they could be a manifestation of hashicorp/consul-k8s#887

@nathancoleman could we maybe implement the same workaround as consul-ecs did in hashicorp/consul-ecs#79 until Consul adds "read your writes" support for an improved consul login UX (without the performance overhead of switching to consistent reads)?

@manobi
Copy link

manobi commented Oct 4, 2022

@mikemorris
Given that my api-gateway-controller is running and I have deployed the Gateway resource, when I apply the ReferenceGrant and HTTPRoute in my secondary dc the routing does not seem to be working.

Is there a way to debug if the routing have actually being registered? Unlike Gateways in primary dc consul ui does not show connections between gateway and target service.

With log-level=trace enabled I saw the following status:

"conditions": [
  |     {
  |       "type": "Ready",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Ready",
  |       "message": "Ready"
  |     },
  |     {
  |       "type": "Scheduled",
  |       "status": "True",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "Scheduled",
  |       "message": "Scheduled"
  |     },
  |     {
  |       "type": "InSync",
  |       "status": "False",
  |       "observedGeneration": 1,
  |       "lastTransitionTime": "2022-10-04T22:52:16Z",
  |       "reason": "SyncError",
  |       "message": "error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n"
  |     }
  |   ],

HTTPRoute resource status seems to be ok but it's working:

status:
  parents:
    - conditions:
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: Route accepted.
          observedGeneration: 1
          reason: Accepted
          status: 'True'
          type: Accepted
        - lastTransitionTime: '2022-10-04T23:04:20Z'
          message: ResolvedRefs
          observedGeneration: 1
          reason: ResolvedRefs
          status: 'True'
          type: ResolvedRefs

Upstreams in secondary DC (0):
Screen Shot 2022-10-04 at 20 15 12

Upstreams in primary DC (1):
Screen Shot 2022-10-04 at 20 15 42


consul-k8s proxy read <gateway-pod-name> -context=dc2:

==> Clusters (3)
==> Endpoints (3)	
==> Listeners (1)
==> Routes (1)
==> Secrets (2)

consul-k8s proxy read <gateway-pod-name> -context=dc1:

==> Clusters (6)
==> Endpoints (6)
==> Listeners (2)
==> Routes (1)
==> Secrets (2)

@nathancoleman
Copy link
Member

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

@manobi
Copy link

manobi commented Oct 12, 2022

Hi @manobi , were you able to get this working? Just to clarify, your Gateway, HTTPRoute, ReferenceGrant and backend Service that the route is targeting are all in the secondary datacenter, correct?

Yes they are all running in the secondary datacenter, but I have not being able to get this working. Still seeing the following in api-gateway-controller:

error adding ingress config entry: 1 error occurred:\n\t* Unexpected response code: 403 (rpc error making call: rpc error making call: rpc error making call: Permission denied: token with AccessorID '0323cd06-e494-1d61-2cc9-3f8570954046' lacks permission 'mesh:write')\n\n

How can I force this "mesh:write" permission ?

@manobi
Copy link

manobi commented Oct 13, 2022

if err := a.setConfigEntries(ctx, addedDefaults...); err != nil {

The gateway deployment is running in secondary datacenter, but there is no service-default or ingress-gateway registered.
What policy should api-gateway-controller use to able to register those configs?

@nathancoleman
Copy link
Member

nathancoleman commented Oct 13, 2022

@manobi I'd expect it to be using api-gateway-controller-policy-<datacenter> which has the higher-level operator = "write" permission. You can see what I'm expecting in the screenshot a ways up #300 (comment).

It makes sense that the config entries aren't registered because the controller isn't able to create them in your setup. I'm not yet sure why this is, and I haven't been able to reproduce it myself.

Just to be certain, to replicate your setup, I need consul-k8s v0.48.0 in my primary datacenter and consul-k8s v0.49.0 in my secondary datacenter. Is that accurate? Are you using consul-api-gateway v0.5-dev in both datacenters?

@manobi
Copy link

manobi commented Oct 13, 2022

@nathancoleman The only way I've managed to make it work was by attaching thecontroller-policy in api-gateway-controller token.

My current setup is the following one:

Primary datacenter:

  • consul-k8s: v0.48.0
  • hashicorp/consul-api-gateway:0.4.0
  • hashicorp/consul:1.13.2

Secondary datacenter:

  • consul-k8s:v0.49.0
  • hashicorppreview/consul-api-gateway:0.5-dev-b98d845e31176332d7c65884f08d1e95ff2897c6
  • hashicorp/consul:1.13.2

@nathancoleman
Copy link
Member

nathancoleman commented Oct 13, 2022

@manobi here's a writeup of the whole process I went through to replicate the issue, but I'm still seeing everything work. I figure at least this will show what the Kubernetes Deployment and Consul roles+policies for the consul-api-gateway-controller should look like. Can you take a look and let me know if anything I'm doing doesn't match your setup or if you can identify the diff between my resulting config and yours? Feel free to comment right on the gist if you like.

https://gist.github.com/nathancoleman/076343780c3e0b4c03fb91f9d4f84616

@manobi
Copy link

manobi commented Oct 14, 2022

@nathancoleman thank you, I'll try to reproduce your steps.
The manual changes I have done, allowed me to test other things. Do you think something changed in 0.5 that would break URLrewrite?

The service router is not reading the filters with URLRewrite:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: my-service
  namespace: consul
spec:
  parentRefs:
  - name: digital-api-qa
  rules:
    - matches:
      - path:
          type: PathPrefix
          value: "/my-service/v1"
      backendRefs:
        - kind: Service
          name: my-service
          namespace: my-service
          port: 80
          weight: 100
      filters:
      - type: URLRewrite
        urlRewrite:
          path:
            type: ReplacePrefixMatch
            replacePrefixMatch: "/api/v1"

Becomes:

{
    "Kind": "service-router",
    "Name": "digital-api-qa-735653bb",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/my-service/v1"
                }
            },
            "Destination": {
                "Service": "my-service",
                "RequestHeaders": {}
            }
        }
    ],
    "Meta": {
        "consul-api-gateway/k8s/Gateway.Name": "digital-api-qa",
        "consul-api-gateway/k8s/Gateway.Namespace": "consul",
        "external-source": "consul-api-gateway"
    },
    "CreateIndex": 242705,
    "ModifyIndex": 242705
}

@nathancoleman
Copy link
Member

@manobi thanks for calling that out. Fixed in #414

@nathancoleman
Copy link
Member

@manobi I'm asking around to see if anyone has encountered issues like the role bindings failing to apply at a scale of hundreds of roles/policies.

My understanding is that the missing role bindings are the only issue you're seeing at this point (given the fix in #414) and that everything works as expected when you manually apply those bindings. Is that accurate?

@manobi
Copy link

manobi commented Oct 17, 2022

@nathancoleman Accurate. The ACL not found error is not restricted to API gateway, I can see it in other components that eventually reconcile.

It might be the problem mentioned by @mikemorris, if I have to rolebind manually it's not a huge problem.

I was more worried while I have no ideas what was going on.
Thank you.

@manobi
Copy link

manobi commented Oct 17, 2022

@nathancoleman Will the #414 fix be automatically published to Docker Hub or is it a manual action?
I'm looking forward to put my hands on it and maybe create another issue in consul-k8s to investigate the race condition in ACL, as it looks like there is no problems consul-api-gateway itself.

Seems unfair to hold the v0.5 release if there are no other issues.

@nathancoleman
Copy link
Member

nathancoleman commented Oct 17, 2022

@manobi you'll see it published to Docker Hub in a few minutes after I merge #416. The merge of #414 itself didn't publish because our tooling identified the CVE referenced in #416.

Edit: You can now see an updated set of tags out on https://hub.docker.com/r/hashicorppreview/consul-api-gateway/tags

@manobi
Copy link

manobi commented Oct 17, 2022

Just to confirm that I've got the URLrewrite back to work with: hashicorppreview/consul-api-gateway:0.5-dev-55da4a56cda79d0e97a7f2d40f503923ff57ba62

Thank you @nathancoleman

@nathancoleman
Copy link
Member

nathancoleman commented Oct 27, 2022

@codex70 @manobi I believe this particular issue can be closed now but wanted to run it by you first. Thoughts?

The upcoming v0.5.0 release of Consul API Gateway will allow you to run the API gateway controller and create Gateways that route to services within the same datacenter whether that datacenter is a primary or secondary datacenter.

@manobi
Copy link

manobi commented Oct 27, 2022

We should close it.
Thanks

@codex70
Copy link
Author

codex70 commented Nov 2, 2022

Just to confirm I have been able to test this and it is now working following on from the fix for: hashicorp/consul-k8s#1344

@codex70 codex70 closed this as completed Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants