Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio Authservice - Envoy filters/denies traffic to Keycloak token endpoint - AWS EKS environment #1111

Open
brcourt opened this issue Dec 12, 2024 · 18 comments

Comments

@brcourt
Copy link

brcourt commented Dec 12, 2024

So far, I should be simply enabling Authservice for a UDS package I currently have running in my cluster. I've been struggling to understand why the connection to the token endpoint is dropped, but even with debug level logs enabled for Authservice, istio-sidecar, and Keycloak, there isn't any indication of why the connection is failing.

Using Authservice for the first time with UDS, and the OIDC flow works correctly, getting a session token and hitting the auth endpoint, however the last step of hitting the token endpoint fails and the connection is dropped. Below are the logs I've been able to find for this request.

Authservice / Istio-Sidecar debug logs

time="2024/12/11 18:01:03" level=info msg="performing request to retrieve new tokens" x-request-id="93985b6a-a7d5-419b-977c-dd7ef0a21351" scope="authz" type="oidc" session-id="KfwJkLzbrXFsBNPAvijpswrJlyNUpVDRRlfIVXTcsirBb7QXPZ5kl6P7XfeOlhhO"
time="2024/12/11 18:01:03" level=debug msg="request" scope="idp" data="POST /realms/uds/protocol/openid-connect/token HTTP/1.1\r\nHost: sso.uds.my-domain.com\r\nauthorization: Basic dWRzLXBhY2thZ2UtYXJyYWtpczpFTWtERmJuRkozVVA3UFU3WUM3S3VKTnRzQklEUEp1SQ==\r\ncontent-type: application/x-www-form-urlencoded\r\n\r\ncode=ea9991df-fb82-40bc-842c-b18dbcbf727d.fea0b5e3-1095-4ff3-b69b-7a9bff03d2db.9fa2ae93-4034-4da0-ab12-cdf6f3badca4&code_verifier=dq_rPQsuE3RjUGMm7lo-MUgV_L5aAX1raQNiOy3gzdA&grant_type=authorization_code&redirect_uri=https%3A%2F%2Fmy-app.uds.my-domain.com%2Foauth2%2Fcallback"
time="2024/12/11 18:01:13" level=error msg="error performing tokens request to OIDC" x-request-id="93985b6a-a7d5-419b-977c-dd7ef0a21351" scope="authz" type="oidc" session-id="KfwJkLzbrXFsBNPAvijpswrJlyNUpVDRRlfIVXTcsirBb7QXPZ5kl6P7XfeOlhhO" error="Post "https://sso.uds.my-domain.com/realms/uds/protocol/openid-connect/token": read tcp 10.4.136.219:36048->15.200.41.154:443: read: connection reset by peer"
time="2024/12/11 18:01:13" level=debug msg="process result" x-request-id="93985b6a-a7d5-419b-977c-dd7ef0a21351" scope="authz" type="oidc" session-id="KfwJkLzbrXFsBNPAvijpswrJlyNUpVDRRlfIVXTcsirBb7QXPZ5kl6P7XfeOlhhO" allow=false status="Internal"
time="2024/12/11 18:01:13" level=debug msg="filter result" x-request-id="93985b6a-a7d5-419b-977c-dd7ef0a21351" scope="authz" chain="placeholder" chain="uds-package-my-app" index=0 allow=false error=<nil>
time="2024/12/11 18:01:13" level=debug msg="response" x-request-id="93985b6a-a7d5-419b-977c-dd7ef0a21351" scope="requests" method="/envoy.service.auth.v3.Authorization/Check" data="{\"status\":{\"code\":13}, \"deniedResponse\":{\"headers\":[{\"header\":{\"key\":\"cache-control\", \"value\":\"no-cache\"}}, {\"header\":{\"key\":\"pragma\", \"value\":\"no-cache\"}}]}}" error=<nil>

Keycloak logs

time="2024/12/11 14:59:51" level=info msg="session id cookie is missing" x-request-id="3c3f22bf-7a3c-455e-98fe-7595295f3e93" scope="authz" type="oidc" cookie-name="__Host-uds-package-my-app-authservice-session-id-cookie"
time="2024/12/11 14:59:51" level=info msg="no session cookie detected. Generating new session and sending user to re-authenticate" x-request-id="3c3f22bf-7a3c-455e-98fe-7595295f3e93" scope="authz" type="oidc"
time="2024/12/11 15:01:22" level=info msg="performing request to retrieve new tokens" x-request-id="c08ccc30-15b8-4fad-818d-826bf8f4420f" scope="authz" type="oidc" session-id="HRTwuVw4Uq3xcKfVQ3Xi9UWLGnFbyW1nSVj75eXo7XdAd5eVq7ld1KJDUOFB9XEA"
time="2024/12/11 15:01:32" level=error msg="error performing tokens request to OIDC" x-request-id="c08ccc30-15b8-4fad-818d-826bf8f4420f" scope="authz" type="oidc" session-id="HRTwuVw4Uq3xcKfVQ3Xi9UWLGnFbyW1nSVj75eXo7XdAd5eVq7ld1KJDUOFB9XEA" error="Post "https://sso.uds.my-domain.com/realms/uds/protocol/openid-connect/token": net/http: TLS handshake timeout"

I should be simply enabling Authservice for a service that does run, although obviously with error since I am not yet providing it with an id_token in the Authorization header. My package is configured like so:

sso:
    - name: MyApp Login
      clientId: uds-package-my-app
      redirectUris:
        - "https://my-app.{{ .Values.domain }}/oauth2/callback"
      enableAuthserviceSelector:
        component: apiServer # my custom label

and as you can see, I've been enabling ALL the things:

allow:
      - direction: Ingress
        remoteGenerated: Anywhere

      - direction: Egress
        remoteGenerated: Anywhere

      - direction: Ingress
        remoteGenerated: IntraNamespace

      - direction: Egress
        remoteGenerated: IntraNamespace

      # Egress allowed to Keyclaok
      - direction: Egress
        selector:
          component: apiServer
        remoteNamespace: keycloak
        remoteSelector:
          app.kubernetes.io/name: keycloak
        description: "SSO Provider"
      # SSO
      - direction: Egress
        remoteNamespace: keycloak
        remoteSelector:
          app.kubernetes.io/name: keycloak
        selector:
          component: apiServer
        port: 8080
        description: "SSO Internal"

      - direction: Ingress
        remoteNamespace: authservice
        remoteSelector:
          app.kubernetes.io/name: authservice
        selector:
          component: apiServer
        description: Auth service"

      - direction: Egress
        remoteNamespace: istio-tenant-gateway
        remoteSelector:
          app: tenant-ingressgateway
        selector:
          component: apiServer
        port: 443
        description: "SSO External"

I have the AuthorizationPolicies and RequestPolicies that I am expecting, and the authservice-uds secret contains all the correct information I would expect for my client.

Other perhaps important information

  • The cluster is running in EKS, in a GovCloud environment
  • The nodes are running the latest Bottlerocket AMI's with the CIS Level 2 hardening bootstrap script
  • The tenant and admin ingress gateway loadbalancers have all traffic allowed from the entire VPC
  • I have updated the coredns CoreFile with the rewrite rules from the dns.sh shell script I found in your "Narwhal" swf repo

I am at a loss for what I can do next to troubleshoot the issue. As far as I can tell, I am simply turning on Authservice and pointing it at my running service. I have made no changes to the Keycloak clients that were created or the Authorization/Request Policies or exemptions. I've allowed all the traffic across the cluster...

I'm having a really frustrating experience trying to find out what is going on without any revealing information from the logs. I assume there is something that I am missing, or some caveat perhaps when running in EKS or maybe bottlerocket, but I didn't see any references in your documentation or any issues/PRs in this repo, hence I would like to see if there is something wrong, or perhaps an opportunity to improve the UDS-Core documentation.

@brcourt
Copy link
Author

brcourt commented Dec 12, 2024

On a related note, since getting the UDS cluster running, I have had issues authenticating with Neuvector. I sometimes am able to login via SSO, other times, I get an authentication error. When I am able to login to Neuvector, I only ever see 2 nodes being scanned (out of 3). The enforcers show all 3 pods running, and I can see in the Neuvector dashboard the node names the enforcers are running on, so it definitely should not be showing only 2 nodes.

I haven't dug into why that is happening yet, but it just occurred to me that they may be related.

@brcourt brcourt changed the title Istio Authservice - Envoy filters/denies traffic to Keycloak token endpoint Istio Authservice - Envoy filters/denies traffic to Keycloak token endpoint - AWS EKS environment Dec 12, 2024
@brcourt
Copy link
Author

brcourt commented Dec 12, 2024

I was able to replace the JWKS and token URLs in the authservice-uds secret with local cluster endpoints - http://keycloak-http.keycloak.svc:8080/..., and that allowed the token to get pulled.

I am now however getting an RBAC: access denied error in the browser after authenticating. This is coming from Istio, not my application. Nothing in the istio debug logs show why this is happening. I don't like my workaround and I still would like to find out what I am doing wrong and update docs to make this process less painful.

Any help would be appreciated, thanks!

@mjnagel
Copy link
Contributor

mjnagel commented Dec 12, 2024

The RBAC: access denied message you're receiving after login I believe is expected based on the modifications you made. The token that Authservice retrieved likely is from the wrong issuer causing that failure. I think it may be more useful to chase down the original error since using those external addresses should work for token/jwks.

I know we currently have environments running in EKS in GovCloud so I don't anticipate any major hurdles there. In terms of your configuration are you setting a valid domain + cert, with DNS setup for those? I noticed the mention of coredns and the DNS script, I'm not overly familiar with the referenced script and am wondering if that could be causing some of the issues here. Would you be able to shell into one of your pods in cluster and perform a curl request to sso.uds.my-domain.com (making sure that it resolves to what you expect and don't encounter any errors)?

The two errors I'm seeing in the logs you provided are read: connection reset by peer and net/http: TLS handshake timeout. Do you happen to see anything in the istio-proxy logs for Keycloak when those requests come through?

@brcourt
Copy link
Author

brcourt commented Dec 13, 2024

In terms of your configuration are you setting a valid domain + cert, with DNS setup for those?

Yes, I have a valid cert and DNS for the tenant and admin endpoints

I'm not overly familiar with the referenced script and am wondering if that could be causing some of the issues here.

This issue occurs with or without the updated CoreDNS config. Link to the script below.
https://github.com/defenseunicorns/narwhal-delivery-iac-swf-reference-deployment/blob/main/hack/dns.sh

Would you be able to shell into one of your pods in cluster and perform a curl request to sso.uds.my-domain.com (making sure that it resolves to what you expect and don't encounter any errors)?

Pods can NOT reach the sso.uds.my-domain.com domain or the default NLB domain.

Do you happen to see anything in the istio-proxy logs for Keycloak when those requests come through?

No logs from any Keycloak pods when the timeout happens

@brcourt
Copy link
Author

brcourt commented Dec 13, 2024

Enabling Istio DNS proxying on cm/istio seems to resolve the issue, at least get me to the to the RBAC: access denied.

proxyMetadata:
    # Enable basic DNS proxying
    ISTIO_META_DNS_CAPTURE: "true"

@mjnagel
Copy link
Contributor

mjnagel commented Dec 13, 2024

Interesting, that's part of why I asked about DNS - if the DNS address for your SSO provide resolves to localhost you would hit an issue similar to this one. Not sure if that's part of what you're encountering?

With the RBAC: access denied you should be seeing a log entry on your application's sidecar (the one protected by Authservice) that hopefully has more information. Is this after reverting the changes to the authservice-uds secret as well?

@brcourt

This comment was marked as resolved.

@brcourt
Copy link
Author

brcourt commented Dec 13, 2024

Hmmm I added a coredns rewrite rule so that requests to SSO domain are routed to the tenant-ingressgateway virtual service. I am still seeing the same error.

time="2024/12/13 19:51:47" level=error msg="error performing tokens request to OIDC" x-request-id="6b35827f-a854-408e-b8be-feac602312b4" scope="authz" type="oidc" session-id="NQmgJGeB5dJhfXJWp80fOfDHp57fGLbQNbELy7SSb7zYSyIj3KC8tzKKnrfEWeNZ" error="Post "https://sso.uds.my-domain.com/realms/uds/protocol/openid-connect/token": read tcp 10.4.136.38:34440->172.20.44.78:443: read: connection reset by peer"

The 172.20.44.78:443 address is correct, that is the address of the ingress gateway. Why would envoy be dropping traffic that is being routed locally?

istio proxy-config endpoint for authservice:
endpoint/172.20.44.78:443 HEALTHY outbound|443||sso.uds.beta.dev.dreadnought.teambespin.us

@brcourt
Copy link
Author

brcourt commented Dec 13, 2024

Interesting, that's part of why I asked about DNS - if the DNS address for your SSO provide resolves to localhost you would hit an issue similar to this one. Not sure if that's part of what you're encountering?

With the RBAC: access denied you should be seeing a log entry on your application's sidecar (the one protected by Authservice) that hopefully has more information. Is this after reverting the changes to the authservice-uds secret as well?

The DNS address for the UDS SSO provider URL resolves to a publicly available, external AWS Network Load Balancer.

Also, in the last 3 days, I have not been able to get a single request to hit my application or sidecar after obtaining a token. I can't even test my application until that happens, since it relies on Authservice to provide a token.

And yes, I reverted my changes to authservice-uds. I created a brand new cluster and the issue persists. It seems likely to be some infrastructure/EKS/AWS configuration that is incompatible with Authservice in UDS, although I'm not sure what that could be...

@mjnagel
Copy link
Contributor

mjnagel commented Dec 13, 2024

Hmmm I added a coredns rewrite rule so that requests to SSO domain are routed to the tenant-ingressgateway virtual service. I am still seeing the same error.

The DNS address for the UDS SSO provider URL resolves to a publicly available, external AWS Network Load Balancer.

I know you mentioned that the coredns change didn't seem to affect it one way or the other, but just to explain it - that should only be necessary if your SSO URL does not have a valid DNS entry setup for it. As long as DNS resolves for the URL (to something other localhost) you shouldn't need the coredns script. That's primarily used as a local dev workaround to allow using a domain that resolves everything to localhost.

Nothing in particular stands out in the cluster/cloud setup from what the details I have - we definitely have validated deployments on EKS in GovCloud in the past. Do you encounter issues with other services, or just authservice? grafana.admin.<domain> might be a good one to test a login to. I'm wondering if something might be misconfigured with security groups between cluster nodes/the load balancer that is causing an issue here?

I haven't seen any errors to indicate this is the issue but if your cert happens to be self-signed or signed by a Private PKI root you would need to add additional configuration for Authservice to properly trust it.

@mikeherrera
Copy link

@mjnagel, I'm working this issue with Brandon. Any connection to Istio Proxy's DNS_CAPTURE not being enabled by default in a UDS Core deployment? Wouldn't this prevent the Keycloak's package ServiceEntry from being utilized since all DNS lookups would go to kube dns?

@brcourt
Copy link
Author

brcourt commented Dec 13, 2024

Grafana works perfectly fine without any issues. We are not using using a private PKI.

This whole thing is weird because the gateways are publicly available, I can route to my domain properly from outside the cluster. Routing inside the cluster correctly proxies to the NLB public IP addresses so it should work, however the traffic just seemingly gets dropped by envoy with no explanation or logs. And yes, the security groups on the NLB allow ingress from the entire VPC (I wish it were that simple).

@mjnagel
Copy link
Contributor

mjnagel commented Dec 13, 2024

@mikeherrera so from what we've seen the service entries actually do get used, with the caveat of ONLY when kubedns resolves an address to something outside of the pod (i.e. must resolve and not be localhost). Normally (outside of dev scenarios) DNS for those address should meet that criteria, so it should be functional. Once DNS resolves outside of the pod, the istio proxy handles actually directing the traffic to the service entry destination (rather than the original IP it was resolved to).

There's two values noted in this issue that could be used to switch to fully using Istio for DNS. We haven't done a ton of testing on that and hadn't had time to dig into the effects/why it isn't the default upstream which is why we haven't made that the default in uds-core yet. If this does end up being the problem here we can definitely evaluate it to see, but from what we experienced it was only necessary in the scenario described where kubedns either doesn't resolve an address or resolves it to localhost.

If this is a dev cluster it might be worth seeing if network policies are causing any of the issues here and just deleting them across keycloak, authservice, and <your-app> namespaces. We do rely on those service entries to behave as expected in order for our network policies to be correct (example) so if something is going wrong the network policies could be causing an issue.

@mikeherrera
Copy link

We'll dig into the latter suggestions more, and get back to you.

In regards to the linked issue, that's exactly what I found and continue to experiment with. However, even with DNS_CAPTURE enabled, I'm not seeing a listener being added for the sidecar's proxy DNS.

I understand what you're saying about your intention for the Service Entry, and the caveat, but unless I'm misunderstanding I'm reading in the docs here, that intention seems contradictory to functionality intended by Istio?

My understanding - the Istio Proxy should intercept the DNS request so that instead of resolution to the external, AWS NLB name, it would resolve to the matching internal, cluster service name, as defined by the corresponding Service Entry. It makes no sense for the traffic to leave the network just to come back. No matter what we do, we see the sso authservice name resolve to the external NLB DNS with kube DNS versus the Service Entry's value.

I'm going to build a fresh cluster to remove some "variables" and test further.

@mjnagel
Copy link
Contributor

mjnagel commented Dec 14, 2024

In regards to the linked issue, that's exactly what I found and continue to experiment with. However, even with DNS_CAPTURE enabled, I'm not seeing a listener being added for the sidecar's proxy DNS.

You may have already done this but just to make sure - for any changes take effect you would need to cycle istiod, followed by the workload itself I believe in order to make sure you have the latest proxy config for that workload.

With regards to the service entry and the DNS docs you linked, DNS proxying is absolutely required for service entries that aren't normally resolvable by DNS. I think a key distinction and clarification is that service entries can still have a place and utility even if not using Istio's DNS proxying. Reading the DNS proxying docs:

While Kubernetes provides DNS resolution for Kubernetes Services out of the box, any custom ServiceEntrys will not be recognized. With this feature, ServiceEntry addresses can be resolved without requiring custom configuration of a DNS server.

The key part there is the last sentence and custom configuration of a DNS server - if a given host in a service entry is DNS resolvable then the DNS proxying is not required to make the ServiceEntry resolvable. Within uds-core we aren't relying on the service entry for resolution necessarily, but we are relying on it to ensure the proxy redirects traffic destined for an (already resolved) host to a specific endpoint (in this case to our cluster service for the gateway). I'll try and identify if there's any clear docs that lay this functionality out in addition to looking at listeners, etc to validate this setup.

@mjnagel
Copy link
Contributor

mjnagel commented Dec 14, 2024

@mikeherrera I did some validation on the behavior described above, testing outside of uds-core to keep the behavior isolated and simple. This gist contains a walkthrough with two different tests of service entries using k3d and istioctl. The second walkthrough is very similar to our setup in uds-core, creating a service entry for each virtual service in our cluster. Note that this was all done without DNS proxying behavior enabled. Hope that helps to display the behavior a bit better - I haven't found any great documentation on this as most use cases of service entries are either using external hosts (that are truly external to the cluster, without endpoints defined) or using internal hosts.

@brcourt
Copy link
Author

brcourt commented Dec 16, 2024

Assuming DNS is working as expected, I'm still stuck at getting traffic to reach the sso.uds.dev endpoint. From inside a container in any namespace, that traffic is dropped.

All other traffic works as expected. I can access all other services behind my load balancer. From within my k8s network, I can reach external services. On a brand new cluster with UDS-Core deployed, I deployed the following Zarf package:

apiVersion: v1
kind: Namespace
metadata:
  name: authservice-test-app
---
apiVersion: uds.dev/v1alpha1
kind: Package
metadata:
  name: httpbin-other
  namespace: default
spec:
  sso:
    - name: Demo SSO
      clientId: uds-core-httpbin
      redirectUris:
        - "https://protected.uds.dev/login"
      enableAuthserviceSelector:
        app: httpbin
      # groups:
      #   anyOf:
      #     - "/UDS Core/Admin"
  network:
    expose:
      - service: httpbin
        selector:
          app: httpbin
        gateway: tenant
        host: protected
        port: 8000
        targetPort: 8000
    allow:
      - direction: Ingress
        remoteGenerated: Anywhere

      - direction: Egress
        remoteGenerated: Anywhere
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: httpbin
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  name: httpbin
  namespace: default
  labels:
    app: httpbin
spec:
  ports:
    - name: http
      port: 8000
      targetPort: 8000
  selector:
    app: httpbin
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: httpbin
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: httpbin
  template:
    metadata:
      labels:
        app: httpbin
    spec:
      serviceAccountName: httpbin
      containers:
        - image: docker.io/kong/httpbin:0.2.1
          imagePullPolicy: IfNotPresent
          name: httpbin
          command:
            [
              "pipenv",
              "run",
              "gunicorn",
              "-b",
              "0.0.0.0:8000",
              "httpbin:app",
              "-k",
              "gevent",
            ]
          resources:
            limits:
              cpu: 50m
              memory: 64Mi
            requests:
              cpu: 50m
              memory: 64Mi
          ports:
            - containerPort: 8000

So this should be stock UDS and a basic, simple authservice flow. The only thing that is different is the cluster running on EKS.

@mjnagel
Copy link
Contributor

mjnagel commented Dec 17, 2024

Assuming DNS is working as expected, I'm still stuck at getting traffic to reach the sso.uds.dev endpoint. From inside a container in any namespace, that traffic is dropped.

@brcourt to isolate where the issues might be popping up could you test a few requests in cluster?

  1. This pod will run in the default namespace without istio injection and without any network policies coming from uds-core. If this fails then there is likely a different networking issue at play.
kubectl run curl-pod --rm -it --restart=Never --image=curlimages/curl -- curl -v https://sso.uds.dev
  1. This pod will run in the default namespace with istio injection but no network policies from uds-core. If this fails then there is probably something going wrong at the istio layer.
kubectl run curl-pod --rm -it \
  --restart=Never \
  --image=curlimages/curl \
  --overrides='{
    "metadata": {
      "labels": {
        "sidecar.istio.io/inject": "true",
        "batch.kubernetes.io/job-name": "job"
      }
    }
  }' -- curl -v https://sso.uds.dev

If both of those succeed then we've probably isolated the issue to network policies - were you able to test hitting that endpoint from a pod after deleting all network policies in the namespace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants