Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

Sometimes port 80 goes to the tcp-router healthcheck #1199

Closed
jandubois opened this issue Aug 7, 2020 · 8 comments · Fixed by #1526
Closed

Sometimes port 80 goes to the tcp-router healthcheck #1199

jandubois opened this issue Aug 7, 2020 · 8 comments · Fixed by #1526
Labels
Priority: Medium Status: Validation Need to brainstorm before starting suse-cap Type: Bug Something isn't working
Milestone

Comments

@jandubois
Copy link
Member

Describe the bug

Sometimes I set up a kubecf on minikube, login and push an app. Everything works, but accessing the app returns a 503:

$ curl -I http://12factor.192.168.99.219.omg.howdoi.website
HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html

Accessing the app via https works:

$ curl -k -I https://12factor.192.168.99.219.omg.howdoi.website
HTTP/1.1 200 OK
Content-Length: 6986
Content-Type: text/html;charset=utf-8
Server: thin
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Vcap-Request-Id: c77c7e13-88f4-4dff-56c2-c94a987bebd5
X-Xss-Protection: 1; mode=block
Date: Thu, 06 Aug 2020 22:56:52 GMT

@mook-as provided some feedback:

I'm going to guess that something got confused again, and port 80 is hitting tcp-routing healthcheck instead
Hand edit the tcp-router-public service to drop port 80 for now (since you don't need it)

So I ran kubectl edit svc -n kubecf tcp-router-public and removed this section:

  - name: healthcheck
    nodePort: 32388
    port: 80
    protocol: TCP
    targetPort: 80

Afterwards the app worked normally on port 80 as well.

I've experienced this issues 4 or 5 times over the last couple of days, so maybe 30-40% of the time.

@jandubois jandubois changed the title Sometimes port 80goes to the tcp-router healthcheck Sometimes port 80 goes to the tcp-router healthcheck Aug 7, 2020
@jandubois jandubois added the Type: Bug Something isn't working label Aug 11, 2020
@fargozhu fargozhu added Priority: Medium Status: Accepted This issue will be implemented in a near future labels Sep 25, 2020
@fargozhu fargozhu added this to the 2.6.0 milestone Oct 11, 2020
@andreas-kupries andreas-kupries self-assigned this Oct 16, 2020
@andreas-kupries
Copy link
Contributor

A first attempt at reproduction, minikube, diego, SA, using go-env for the app.

In this attempt access via http works:

work@tagetarl:~/SUSE/dev/kubecf-1/_work/go-env> curl -I http://genf.192.168.39.20.xip.io
HTTP/1.1 200 OK
Content-Length: 1635
Content-Type: text/plain; charset=utf-8
Date: Fri, 16 Oct 2020 21:16:32 GMT
X-Vcap-Request-Id: d8cae3ff-a925-460f-6fb7-637153ba259c

Given the non-deterministic nature this is not yet repro failure.

That said, I decided to look at the tcp-router-public service, and note that it listens on port 80 even so. As such I suspect that having the tcp-router-public service listening on 80 is not truly the issue, and more that the go-router dispatches wrongly in the bad case for some reason.

Pulling logs and grepping I see

work@tagetarl:~/SUSE/dev/kubecf-1> grep -rn 'genf.192.168.39.20.xip.io' ~/klog/kubecf/
/home/work/klog/kubecf/diego-cell-0/job/route-emitter-route-emitter/kube.log:262:{"timestamp":"2020-10-16T21:02:13.009379733Z","level":"info","source":"route-emitter","message":"route-emitter.watcher.handling-event.set-routes","data":{"after":{"domain":"cf-apps","instances":1,"process-guid":"4d0853db-cdfc-4955-abbc-4755ea5560ec-4136167d-ba2d-4e9a-8c1b-acb4ef95398f","routes":{"cf-router":[{"hostnames":["genf.192.168.39.20.xip.io"],"port":8080,"route_service_url":null,"isolation_segment":null}],"internal-router":[],"tcp-router":[]}},"before":{},"session":"8.177"}}

The app routes are announced via route-emitter, for the go-router to see.

First thought now that there is no proper announcement of the route in the bad case ?
But then why would removal of the port 80 on the tcp-router-public service then fix things ?
So maybe more something in the go-router getting confused by the multiple destinations for 80 ?
And removal from the service leaves it (go-router) with the app route, forcing the proper dispatch ?


@jandubois When you say

I've experienced this issues 4 or 5 times over the last couple of days, so maybe 30-40% of the time.

Is that for different kubecf deployments, or for different app deployments on the same kubecf ?
IOW, to repro, do I have to delete and re-deploy kubecf until it fails, or re-deploy apps on the same kubecf until one has the problem ?

@fargozhu fargozhu added Status: Validation Need to brainstorm before starting and removed Status: Accepted This issue will be implemented in a near future labels Oct 19, 2020
@mook-as
Copy link
Contributor

mook-as commented Oct 19, 2020

That said, I decided to look at the tcp-router-public service, and note that it listens on port 80 even so. As such I suspect that having the tcp-router-public service listening on 80 is not truly the issue, and more that the go-router dispatches wrongly in the bad case for some reason.

I believe the problem is not gorouter; it's that we expose the TCP router health check service at the Kubernetes level on port 80, so which service gets used when you contact host:80 is non-deterministic.

- name: healthcheck
protocol: TCP
port: 80
targetPort: 80

I believe you'll only hit this when not using ingress.

@andreas-kupries
Copy link
Contributor

@mook-as Non-deterministic at what level ? Just tried in my current setup (minikube, no ingress, go-env app) ...

A series of 40 curl's to the app all returned 200 OK. Whatever non-determinism is involved seems to be something which happens early and is then fixed for the app.

Next, re-deployed the app 10 times, and checked if that was enough to see the issue ... It wasn't.

It seems whatever happens is locked in as part of kubecf deployment.

I.e. will now have to re-deploy kubecf several times as part of checking.

@mook-as
Copy link
Contributor

mook-as commented Oct 19, 2020

@andreas-kupries I believe that once one service has bound, it'll stay bound (so you will consistently see one or the other, until you re-deploy KubeCF).

@andreas-kupries
Copy link
Contributor

Hm. A race condition then, which service is up first, or at least, is seen and bound first by kube.

@andreas-kupries
Copy link
Contributor

Definitely a a race condition.

@mook-as proposed

I think you may be able to reproduce this easier if you temporarily delete the router-public service (so that TCP binding binds to port 80) and then recreating it? Not sure.

With the cluster I had up and working, i.e. http access ok I then did:

  • Saved the service definitions
    • k get svc :tcp-router-public -o yaml > TP
    • k get svc :^router-public -o yaml > RP
  • Deleted router-public service ➡️ http access to app becomes 503 Service Unavailable.
  • Recreated the router-public service. http access to app stays 503.
  • Deleted tcp-router-public service ➡️ http access to app becomes 200 OK.
  • Recreated the tcp-router-public service. http access to app stays 200.

So, whichever of the services has the luck to be bound by kube first on startup is where access to port 80 is dispatched.
Most of the time this is the desired service, router-public.
However it can happen to bind to the tcp-router-public instead.

With the above we currently have to workaround to fix the issue when it happens:

  • Edit tcp-router-public via kubectl edit svc -n kubecf tcp-router-public and remove the healthcheck section.
  • Save the tcp-router-public definition, then delete and recreate it. In the window of time where the service is gone kubecf switches the binding of port 80 over to the desired router-public.

A proper fix would be to not declare the healthcheck port from the beginning.

However, IIRC, we need the port declared for some platforms to work correctly, i.e. to not break the entire tcp routing.

IIRC it is AWS which needs the healthcheck port so that the AWS loadbalancer can detect availability of the service and thus open it to the public. IOW without AWS would not open/start the loadbalancer, leaving tcp routing offline.
I think that it is only AWS which needs this (memory is hazy).

Is there a way for a helm chart to detect the kind of kube platform it is deployed to ?
If yes, then it might be sensible to detect AWS and expose the healthcheck port only for this platform.

ping @jandubois @mook-as

@mook-as
Copy link
Contributor

mook-as commented Oct 20, 2020

I believe that we need a healthcheck when using a load balancer, at least on EKS / AKS / whatever? But not when we're using ClusterIP.

@andreas-kupries
Copy link
Contributor

Hm.

kubecf  router             ClusterIP      None             <none>           80/TCP,443/TCP                                 
kubecf  router-0           ClusterIP      10.105.145.219   <none>           80/TCP,443/TCP                                 
kubecf  router-public      LoadBalancer   10.103.75.127    192.168.39.198   80:31314/TCP,443:32037/TCP                     
kubecf  tcp-router         ClusterIP      None             <none>           80/TCP, ...
kubecf  tcp-router-0       ClusterIP      10.99.189.224    <none>           80/TCP, ...
kubecf  tcp-router-public  LoadBalancer   10.107.193.90    192.168.39.198   80:30966/TCP, ...

The two exposed services are LoadBalancer, and per the values.yaml that is the default, see services: hierarchy.
Note, this looks to be for all platforms.

Another thing to consider. We have seen this only for minikube deployments, right ?
IOW if this is not happening for the public platforms, only for dev, then this issue may not even be medium priority.

ping @jandubois @mook-as

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Priority: Medium Status: Validation Need to brainstorm before starting suse-cap Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants