Sometimes port 80 goes to the tcp-router healthcheck #1199

jandubois · 2020-08-07T00:14:36Z

Describe the bug

Sometimes I set up a kubecf on minikube, login and push an app. Everything works, but accessing the app returns a 503:

$ curl -I http://12factor.192.168.99.219.omg.howdoi.website
HTTP/1.0 503 Service Unavailable
Cache-Control: no-cache
Connection: close
Content-Type: text/html

Accessing the app via https works:

$ curl -k -I https://12factor.192.168.99.219.omg.howdoi.website
HTTP/1.1 200 OK
Content-Length: 6986
Content-Type: text/html;charset=utf-8
Server: thin
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Vcap-Request-Id: c77c7e13-88f4-4dff-56c2-c94a987bebd5
X-Xss-Protection: 1; mode=block
Date: Thu, 06 Aug 2020 22:56:52 GMT

@mook-as provided some feedback:

I'm going to guess that something got confused again, and port 80 is hitting tcp-routing healthcheck instead
Hand edit the tcp-router-public service to drop port 80 for now (since you don't need it)

So I ran kubectl edit svc -n kubecf tcp-router-public and removed this section:

  - name: healthcheck
    nodePort: 32388
    port: 80
    protocol: TCP
    targetPort: 80

Afterwards the app worked normally on port 80 as well.

I've experienced this issues 4 or 5 times over the last couple of days, so maybe 30-40% of the time.

The text was updated successfully, but these errors were encountered:

andreas-kupries · 2020-10-16T22:14:09Z

A first attempt at reproduction, minikube, diego, SA, using go-env for the app.

In this attempt access via http works:

work@tagetarl:~/SUSE/dev/kubecf-1/_work/go-env> curl -I http://genf.192.168.39.20.xip.io
HTTP/1.1 200 OK
Content-Length: 1635
Content-Type: text/plain; charset=utf-8
Date: Fri, 16 Oct 2020 21:16:32 GMT
X-Vcap-Request-Id: d8cae3ff-a925-460f-6fb7-637153ba259c

Given the non-deterministic nature this is not yet repro failure.

That said, I decided to look at the tcp-router-public service, and note that it listens on port 80 even so. As such I suspect that having the tcp-router-public service listening on 80 is not truly the issue, and more that the go-router dispatches wrongly in the bad case for some reason.

Pulling logs and grepping I see

work@tagetarl:~/SUSE/dev/kubecf-1> grep -rn 'genf.192.168.39.20.xip.io' ~/klog/kubecf/
/home/work/klog/kubecf/diego-cell-0/job/route-emitter-route-emitter/kube.log:262:{"timestamp":"2020-10-16T21:02:13.009379733Z","level":"info","source":"route-emitter","message":"route-emitter.watcher.handling-event.set-routes","data":{"after":{"domain":"cf-apps","instances":1,"process-guid":"4d0853db-cdfc-4955-abbc-4755ea5560ec-4136167d-ba2d-4e9a-8c1b-acb4ef95398f","routes":{"cf-router":[{"hostnames":["genf.192.168.39.20.xip.io"],"port":8080,"route_service_url":null,"isolation_segment":null}],"internal-router":[],"tcp-router":[]}},"before":{},"session":"8.177"}}

The app routes are announced via route-emitter, for the go-router to see.

First thought now that there is no proper announcement of the route in the bad case ?
But then why would removal of the port 80 on the tcp-router-public service then fix things ?
So maybe more something in the go-router getting confused by the multiple destinations for 80 ?
And removal from the service leaves it (go-router) with the app route, forcing the proper dispatch ?

@jandubois When you say

I've experienced this issues 4 or 5 times over the last couple of days, so maybe 30-40% of the time.

Is that for different kubecf deployments, or for different app deployments on the same kubecf ?
IOW, to repro, do I have to delete and re-deploy kubecf until it fails, or re-deploy apps on the same kubecf until one has the problem ?

mook-as · 2020-10-19T21:12:19Z

That said, I decided to look at the tcp-router-public service, and note that it listens on port 80 even so. As such I suspect that having the tcp-router-public service listening on 80 is not truly the issue, and more that the go-router dispatches wrongly in the bad case for some reason.

I believe the problem is not gorouter; it's that we expose the TCP router health check service at the Kubernetes level on port 80, so which service gets used when you contact host:80 is non-deterministic.

kubecf/chart/templates/ingress.yaml

Lines 204 to 207 in 7960f1e

    
           - name: healthcheck 
        
             protocol: TCP 
        
             port: 80 
        
             targetPort: 80

I believe you'll only hit this when not using ingress.

andreas-kupries · 2020-10-19T23:30:21Z

@mook-as Non-deterministic at what level ? Just tried in my current setup (minikube, no ingress, go-env app) ...

A series of 40 curl's to the app all returned 200 OK. Whatever non-determinism is involved seems to be something which happens early and is then fixed for the app.

Next, re-deployed the app 10 times, and checked if that was enough to see the issue ... It wasn't.

It seems whatever happens is locked in as part of kubecf deployment.

I.e. will now have to re-deploy kubecf several times as part of checking.

mook-as · 2020-10-19T23:36:04Z

@andreas-kupries I believe that once one service has bound, it'll stay bound (so you will consistently see one or the other, until you re-deploy KubeCF).

andreas-kupries · 2020-10-19T23:39:18Z

Hm. A race condition then, which service is up first, or at least, is seen and bound first by kube.

andreas-kupries · 2020-10-20T18:04:37Z

Definitely a a race condition.

@mook-as proposed

I think you may be able to reproduce this easier if you temporarily delete the router-public service (so that TCP binding binds to port 80) and then recreating it? Not sure.

With the cluster I had up and working, i.e. http access ok I then did:

Saved the service definitions
- k get svc :tcp-router-public -o yaml > TP
- k get svc :^router-public -o yaml > RP
Deleted router-public service ➡️ http access to app becomes 503 Service Unavailable.
Recreated the router-public service. http access to app stays 503.
Deleted tcp-router-public service ➡️ http access to app becomes 200 OK.
Recreated the tcp-router-public service. http access to app stays 200.

So, whichever of the services has the luck to be bound by kube first on startup is where access to port 80 is dispatched.
Most of the time this is the desired service, router-public.
However it can happen to bind to the tcp-router-public instead.

With the above we currently have to workaround to fix the issue when it happens:

Edit tcp-router-public via kubectl edit svc -n kubecf tcp-router-public and remove the healthcheck section.
Save the tcp-router-public definition, then delete and recreate it. In the window of time where the service is gone kubecf switches the binding of port 80 over to the desired router-public.

A proper fix would be to not declare the healthcheck port from the beginning.

However, IIRC, we need the port declared for some platforms to work correctly, i.e. to not break the entire tcp routing.

IIRC it is AWS which needs the healthcheck port so that the AWS loadbalancer can detect availability of the service and thus open it to the public. IOW without AWS would not open/start the loadbalancer, leaving tcp routing offline.
I think that it is only AWS which needs this (memory is hazy).

Is there a way for a helm chart to detect the kind of kube platform it is deployed to ?
If yes, then it might be sensible to detect AWS and expose the healthcheck port only for this platform.

ping @jandubois @mook-as

mook-as · 2020-10-20T18:08:26Z

I believe that we need a healthcheck when using a load balancer, at least on EKS / AKS / whatever? But not when we're using ClusterIP.

andreas-kupries · 2020-10-20T18:50:40Z

Hm.

kubecf  router             ClusterIP      None             <none>           80/TCP,443/TCP                                 
kubecf  router-0           ClusterIP      10.105.145.219   <none>           80/TCP,443/TCP                                 
kubecf  router-public      LoadBalancer   10.103.75.127    192.168.39.198   80:31314/TCP,443:32037/TCP                     
kubecf  tcp-router         ClusterIP      None             <none>           80/TCP, ...
kubecf  tcp-router-0       ClusterIP      10.99.189.224    <none>           80/TCP, ...
kubecf  tcp-router-public  LoadBalancer   10.107.193.90    192.168.39.198   80:30966/TCP, ...

The two exposed services are LoadBalancer, and per the values.yaml that is the default, see services: hierarchy.
Note, this looks to be for all platforms.

Another thing to consider. We have seen this only for minikube deployments, right ?
IOW if this is not happening for the public platforms, only for dev, then this issue may not even be medium priority.

ping @jandubois @mook-as

jandubois changed the title ~~Sometimes port 80goes to the tcp-router healthcheck~~ Sometimes port 80 goes to the tcp-router healthcheck Aug 7, 2020

jandubois added the Type: Bug Something isn't working label Aug 11, 2020

fargozhu added Priority: Medium Status: Accepted This issue will be implemented in a near future labels Sep 25, 2020

fargozhu added this to the 2.6.0 milestone Oct 11, 2020

andreas-kupries self-assigned this Oct 16, 2020

fargozhu added Status: Validation Need to brainstorm before starting and removed Status: Accepted This issue will be implemented in a near future labels Oct 19, 2020

fargozhu modified the milestones: 2.6.0, 2.7.0 Oct 21, 2020

andreas-kupries mentioned this issue Oct 28, 2020

fix: move tcp-router healthcheck port to avoid clash #1526

Merged

7 tasks

andreas-kupries linked a pull request Oct 28, 2020 that will close this issue

fix: move tcp-router healthcheck port to avoid clash #1526

Merged

7 tasks

andreas-kupries removed their assignment Oct 28, 2020

gaktive added the suse-cap label Nov 3, 2020

andreas-kupries closed this as completed in #1526 Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes port 80 goes to the tcp-router healthcheck #1199

Sometimes port 80 goes to the tcp-router healthcheck #1199

jandubois commented Aug 7, 2020

andreas-kupries commented Oct 16, 2020

mook-as commented Oct 19, 2020

andreas-kupries commented Oct 19, 2020

mook-as commented Oct 19, 2020

andreas-kupries commented Oct 19, 2020

andreas-kupries commented Oct 20, 2020

mook-as commented Oct 20, 2020

andreas-kupries commented Oct 20, 2020

Sometimes port 80 goes to the tcp-router healthcheck #1199

Sometimes port 80 goes to the tcp-router healthcheck #1199

Comments

jandubois commented Aug 7, 2020

andreas-kupries commented Oct 16, 2020

mook-as commented Oct 19, 2020

andreas-kupries commented Oct 19, 2020

mook-as commented Oct 19, 2020

andreas-kupries commented Oct 19, 2020

andreas-kupries commented Oct 20, 2020

mook-as commented Oct 20, 2020

andreas-kupries commented Oct 20, 2020