Possible envoy regression causes HTTP 404 #2468

alex1989hu · 2020-04-25T18:35:47Z

What steps did you take and what happened:
I have upgraded contour to 1.4.0 from 1.3.0. I see many 404 errors in envoy pod log. The services behind HTTPProxy can not be loaded, I got 404 error in browser, too. Constantly hitting refresh button in my browser temporary solves the issue: it can load the page. I thought it is a network issue: after downgraded to 1.3.0 the services behind HTTPProxy instantly loaded, no error log in envoy. Reinstalled the whole cluster from scratch with 1.4.0, the same symptom can be seen. Downgraded back to 1.3.0 solves the issue again.

UPDATE: actually it is 404, not 401.

What did you expect to happen:
No 404 error.

Anything else you would like to add:

100% reproducible
Using GitOps for all the Kubernetes related configuration, no external change occured

Environment:

Contour version:

1.4.0

Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:50:46Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version:

v1.18.0

Cloud provider or hardware configuration:

external: cloud-provider-vsphere

OS (e.g. from /etc/os-release):

Talos (v0.4.1)

The text was updated successfully, but these errors were encountered:

youngnick · 2020-04-26T05:57:07Z

Thanks for the report @alex1989hu. Can you give us a sample HTTPProxy, so we can attempt to replicate please?

alex1989hu · 2020-04-26T08:52:40Z

Having A record and several CNAME; Contour does TLSCertificateDelegation.

I installed 1.4.0 it again today to ensure that is the problem. I found that the error code is 404, not 401. It is my bad, sorry.

Let me share the simplest service where nginx is running, but a more complex service also has the same defect.

Extra:
I see drops and missing TCPv4 SYN-ACKs alert in Grafana, inspected by hubble after installing 1.4.0. Using 1.3.0 does not cause any of them.

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: knowledgebase-proxy
  namespace: knowledgebase
  labels:
    app.kubernetes.io/name: knowledgebase
spec:
  virtualhost:
    fqdn: kb.foo.bar.acme.com
    tls:
      secretName: projectcontour/wildcard-foobar-cert
  routes:
    - conditions:
      - prefix: /
      services:
        - name: knowledgebase-service
          port: 8080
          responseHeadersPolicy:
            set:
              - name: Strict-Transport-Security
                value: "max-age=31536000; includeSubdomains"
              - name: X-Content-Type-Options
                value: nosniff
              - name: X-Frame-Options
                value: DENY
              - name: X-XSS-Protection
                value: "1; mode=block"

jpeach · 2020-04-26T21:01:35Z

@alex1989hu This could be a result of the SNI binding change. Does the client support SNI, and is it using kb.foo.bar.acme.com as the SNI server name?

alex1989hu · 2020-04-26T21:06:38Z

@jpeach: do you mean client like Chrome, Edge? Sure, entered kb.foo.bar.acme.com in the browser's address bar.

jpeach · 2020-04-26T21:09:02Z

Is it an intermittent 404 , or permanent?

alex1989hu · 2020-04-26T21:13:22Z

Intermittent - it can also go wrong after the browser was able to load the content. I mean a simple page refresh can cause 404.

jpeach · 2020-04-26T21:16:49Z

An intermittent problem suggest to me that there's an issue with only some of the envoy proxies. Do you have logs that let you correlate 404s to specific envoy proxies?

alex1989hu · 2020-04-26T21:23:10Z

I am working on a cluster with 1.3.0 but easily can spin up a new cluster. Do you need any extra configuration or stuff like some listed on https://projectcontour.io/docs/v1.4.0/troubleshooting/ ?

jpeach · 2020-04-26T21:44:09Z

@alex1989hu At this point, I don't think we need any new config; we need to narrow the scope of the issue. I think the most likely cause is the SNI issue, but don't know why your usage would bot work with that change. The other problem we need to understand is why the 404 is intermittent.

Testing against a clean 1.4 cluster could be worthwhile. LMK what you find.

jpeach · 2020-04-26T22:02:07Z

@alex1989hu Can you share any of the 404 logs?

alex1989hu · 2020-04-26T22:39:18Z

@jpeach: search kb.foo.bar.acme.com or gitops.foo.bar.acme.com in envoy-v1.4.0-http404. Both produced the symptom.

jpeach · 2020-04-26T22:53:01Z

@alex1989hu Are there separate HTTPProxy documents for gitops.foo.bar.acme.com and kb.foo.bar.acme.com?

In the log, I can see two adjacent entries:

[2020-04-26T22:28:27.120Z] "GET / HTTP/2" 404 NR 0 0 0 - "10.107.8.251" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" "45fc064a-3124-47f2-b49e-cffb6de031e9" "gitops.foo.bar.acme.com" "-"
[2020-04-26T22:30:02.908Z] "GET / HTTP/2" 200 - 0 375 3 2 "10.107.8.251" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" "77f6f10f-da03-4c1f-b369-e7e922571932" "gitops.foo.bar.acme.com" "10.244.0.234:8080"

This looks like the same request from the same client, but with different results. The only way I can explain this is if there are multiple envoys running with different configurations?

alex1989hu · 2020-04-26T22:58:57Z

@jpeach: The cluster is freshly installed w/ v1.4.0. As I mentioned in the initial post, the configuration is the same from the very beginning of cluster 'bootstrapping' (no downgrade, upgrade, same configuration as for v1.3.0 etc.). It is a 5 worker node cluster.

jpeach · 2020-04-26T23:22:01Z

@alex1989hu I've been assuming that you have separate HTTPProxy documents for gitops and kb, its that correct?

Can you post (private is OK) a Envoy config dump? See the troubleshooting guide, and curl the config_dump endpoint.

Can you also show me the pod status for contour and envoy?

kubectl get pods -n projectcontour
kubectl get svc envoy -n projectcontour -o yaml

alex1989hu · 2020-04-27T08:20:56Z

FYI: @jpeach Contacted directly via Slack. Using contour v1.4.0 and envoy v1.3.1 has the same defect. Now the suspect is contour: HTTP/2 session re-use. We tried to start chrome.exe --disable-http2 as a workaround which solves the issue, so contour: HTTP/2 session re-use seems good track.

lmickh · 2020-04-27T16:11:16Z

I've found a similar behavior after upgrading as well. It appears to be related to http2 connection coalescing. The SNI (envoy authority) will not match the requested host name and Envoy will 404 the connection similar to how it behaves in #1493 . So far I've only seen it impact users on Mozilla/Firefox which fits since it appears to have the most aggressive connection coalescing from what I've read. Pretty certain this is due to #2381 , but given the Envoy CVE it probably shouldn't be reverted until Envoy comes up with a fix.

I was able to work around it by issuing separate certs for each virtualhost and updating the httpproxy to use the cert for that virtualhost instead of using a wildcard that covered them all.

jpeach · 2020-04-27T22:00:43Z

Thanks for the CVE reference @lmickh

xref envoyproxy/envoy#6767

jpeach · 2020-04-28T04:14:56Z

I dug into this some more and read the various RFC specs and history. I'm pretty comfortable saying that this is a duplicate of #1493. If we serve a wildcard certificate, browsers will consider existing connections OK to reuse even if the SNI server names associated with the connections differ, because each origin name will match successfully against the wildcard certificate.

This previously worked in Contour because we did not enforce any binding between the SNI server name and the origin hostname. I'd argue that this was always a mis-feature since it breaks multi-tenancy (tenant services are visible to each other), but in 1.4 we had to fix it so that we could make guarantees about TLS client certificate authentication (we can't allow a non-authenticated session to make requests to an origin that requires authentication). Unfortunately, this means that more users are exposed to the underlying problem originally documented in #1493, since basically anyone using wildcard certificates will be affected.

AFAICT, in the Envoy configuration we generate, there are no security implications. The problem manifests as a 404 response; we never forward a request to an inappropriate origin.

The workaround, as noted above by @lmickh, is to avoid wildcard certificates and get a separate certificate for each hostname. I understand that's not going to be possible for everyone. I expect that the right fix is to convince Envoy to serve a 421 response when the SNI server name doesn't match the hosted origin.

Duplicate of #1493.

jpeach · 2020-04-28T04:17:22Z

Duplicate of #1493

erwbgy · 2020-05-21T18:26:57Z

It is worth noting that this is not limited to wildcard certificates. It also occurs if the same Secret object is used for more than one FQDN in different Ingress objects, for example. But the message is the same - use different certificates for each FQDN.

alex1989hu changed the title ~~Possible envoy regression causes HTTP 401~~ Possible envoy regression causes HTTP 404 Apr 26, 2020

jpeach self-assigned this Apr 27, 2020

jpeach closed this as completed Apr 28, 2020

jpeach marked this as a duplicate of #1493 Apr 28, 2020

keithhand mentioned this issue Jan 25, 2023

Refresh sometimes 404s when using Contour/Envoy kubecost/cost-analyzer-helm-chart#1923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible envoy regression causes HTTP 404 #2468

Possible envoy regression causes HTTP 404 #2468

alex1989hu commented Apr 25, 2020 •

edited

Loading

youngnick commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020 •

edited

Loading

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020 •

edited

Loading

alex1989hu commented Apr 27, 2020

lmickh commented Apr 27, 2020 •

edited

Loading

jpeach commented Apr 27, 2020

jpeach commented Apr 28, 2020 •

edited

Loading

jpeach commented Apr 28, 2020

erwbgy commented May 21, 2020 •

edited

Loading

Possible envoy regression causes HTTP 404 #2468

Possible envoy regression causes HTTP 404 #2468

Comments

alex1989hu commented Apr 25, 2020 • edited Loading

youngnick commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020

jpeach commented Apr 26, 2020

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020 • edited Loading

alex1989hu commented Apr 26, 2020

jpeach commented Apr 26, 2020 • edited Loading

alex1989hu commented Apr 27, 2020

lmickh commented Apr 27, 2020 • edited Loading

jpeach commented Apr 27, 2020

jpeach commented Apr 28, 2020 • edited Loading

jpeach commented Apr 28, 2020

erwbgy commented May 21, 2020 • edited Loading

alex1989hu commented Apr 25, 2020 •

edited

Loading

jpeach commented Apr 26, 2020 •

edited

Loading

jpeach commented Apr 26, 2020 •

edited

Loading

lmickh commented Apr 27, 2020 •

edited

Loading

jpeach commented Apr 28, 2020 •

edited

Loading

erwbgy commented May 21, 2020 •

edited

Loading