Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible envoy regression causes HTTP 404 #2468

Closed
alex1989hu opened this issue Apr 25, 2020 · 20 comments
Closed

Possible envoy regression causes HTTP 404 #2468

alex1989hu opened this issue Apr 25, 2020 · 20 comments
Assignees

Comments

@alex1989hu
Copy link
Contributor

alex1989hu commented Apr 25, 2020

What steps did you take and what happened:
I have upgraded contour to 1.4.0 from 1.3.0. I see many 404 errors in envoy pod log. The services behind HTTPProxy can not be loaded, I got 404 error in browser, too. Constantly hitting refresh button in my browser temporary solves the issue: it can load the page. I thought it is a network issue: after downgraded to 1.3.0 the services behind HTTPProxy instantly loaded, no error log in envoy. Reinstalled the whole cluster from scratch with 1.4.0, the same symptom can be seen. Downgraded back to 1.3.0 solves the issue again.

UPDATE: actually it is 404, not 401.

What did you expect to happen:
No 404 error.

Anything else you would like to add:

  • 100% reproducible
  • Using GitOps for all the Kubernetes related configuration, no external change occured

Environment:

  • Contour version:
1.4.0
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:50:46Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version:
v1.18.0
  • Cloud provider or hardware configuration:
external: cloud-provider-vsphere
  • OS (e.g. from /etc/os-release):
Talos (v0.4.1)
@youngnick
Copy link
Member

Thanks for the report @alex1989hu. Can you give us a sample HTTPProxy, so we can attempt to replicate please?

@alex1989hu
Copy link
Contributor Author

Having A record and several CNAME; Contour does TLSCertificateDelegation.

I installed 1.4.0 it again today to ensure that is the problem. I found that the error code is 404, not 401. It is my bad, sorry.

Let me share the simplest service where nginx is running, but a more complex service also has the same defect.

Extra:
I see drops and missing TCPv4 SYN-ACKs alert in Grafana, inspected by hubble after installing 1.4.0. Using 1.3.0 does not cause any of them.

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: knowledgebase-proxy
  namespace: knowledgebase
  labels:
    app.kubernetes.io/name: knowledgebase
spec:
  virtualhost:
    fqdn: kb.foo.bar.acme.com
    tls:
      secretName: projectcontour/wildcard-foobar-cert
  routes:
    - conditions:
      - prefix: /
      services:
        - name: knowledgebase-service
          port: 8080
          responseHeadersPolicy:
            set:
              - name: Strict-Transport-Security
                value: "max-age=31536000; includeSubdomains"
              - name: X-Content-Type-Options
                value: nosniff
              - name: X-Frame-Options
                value: DENY
              - name: X-XSS-Protection
                value: "1; mode=block"

@alex1989hu alex1989hu changed the title Possible envoy regression causes HTTP 401 Possible envoy regression causes HTTP 404 Apr 26, 2020
@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

@alex1989hu This could be a result of the SNI binding change. Does the client support SNI, and is it using kb.foo.bar.acme.com as the SNI server name?

@alex1989hu
Copy link
Contributor Author

@jpeach: do you mean client like Chrome, Edge? Sure, entered kb.foo.bar.acme.com in the browser's address bar.

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

Is it an intermittent 404 , or permanent?

@alex1989hu
Copy link
Contributor Author

Intermittent - it can also go wrong after the browser was able to load the content. I mean a simple page refresh can cause 404.

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

An intermittent problem suggest to me that there's an issue with only some of the envoy proxies. Do you have logs that let you correlate 404s to specific envoy proxies?

@alex1989hu
Copy link
Contributor Author

I am working on a cluster with 1.3.0 but easily can spin up a new cluster. Do you need any extra configuration or stuff like some listed on https://projectcontour.io/docs/v1.4.0/troubleshooting/ ?

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

@alex1989hu At this point, I don't think we need any new config; we need to narrow the scope of the issue. I think the most likely cause is the SNI issue, but don't know why your usage would bot work with that change. The other problem we need to understand is why the 404 is intermittent.

Testing against a clean 1.4 cluster could be worthwhile. LMK what you find.

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

@alex1989hu Can you share any of the 404 logs?

@alex1989hu
Copy link
Contributor Author

@jpeach: search kb.foo.bar.acme.com or gitops.foo.bar.acme.com in envoy-v1.4.0-http404. Both produced the symptom.

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

@alex1989hu Are there separate HTTPProxy documents for gitops.foo.bar.acme.com and kb.foo.bar.acme.com?

In the log, I can see two adjacent entries:

[2020-04-26T22:28:27.120Z] "GET / HTTP/2" 404 NR 0 0 0 - "10.107.8.251" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" "45fc064a-3124-47f2-b49e-cffb6de031e9" "gitops.foo.bar.acme.com" "-"
[2020-04-26T22:30:02.908Z] "GET / HTTP/2" 200 - 0 375 3 2 "10.107.8.251" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" "77f6f10f-da03-4c1f-b369-e7e922571932" "gitops.foo.bar.acme.com" "10.244.0.234:8080"

This looks like the same request from the same client, but with different results. The only way I can explain this is if there are multiple envoys running with different configurations?

@alex1989hu
Copy link
Contributor Author

@jpeach: The cluster is freshly installed w/ v1.4.0. As I mentioned in the initial post, the configuration is the same from the very beginning of cluster 'bootstrapping' (no downgrade, upgrade, same configuration as for v1.3.0 etc.). It is a 5 worker node cluster.

@jpeach
Copy link
Contributor

jpeach commented Apr 26, 2020

@alex1989hu I've been assuming that you have separate HTTPProxy documents for gitops and kb, its that correct?

Can you post (private is OK) a Envoy config dump? See the troubleshooting guide, and curl the config_dump endpoint.

Can you also show me the pod status for contour and envoy?

  • kubectl get pods -n projectcontour
  • kubectl get svc envoy -n projectcontour -o yaml

@alex1989hu
Copy link
Contributor Author

FYI: @jpeach Contacted directly via Slack. Using contour v1.4.0 and envoy v1.3.1 has the same defect. Now the suspect is contour: HTTP/2 session re-use. We tried to start chrome.exe --disable-http2 as a workaround which solves the issue, so contour: HTTP/2 session re-use seems good track.

@lmickh
Copy link

lmickh commented Apr 27, 2020

I've found a similar behavior after upgrading as well. It appears to be related to http2 connection coalescing. The SNI (envoy authority) will not match the requested host name and Envoy will 404 the connection similar to how it behaves in #1493 . So far I've only seen it impact users on Mozilla/Firefox which fits since it appears to have the most aggressive connection coalescing from what I've read. Pretty certain this is due to #2381 , but given the Envoy CVE it probably shouldn't be reverted until Envoy comes up with a fix.

I was able to work around it by issuing separate certs for each virtualhost and updating the httpproxy to use the cert for that virtualhost instead of using a wildcard that covered them all.

@jpeach jpeach self-assigned this Apr 27, 2020
@jpeach
Copy link
Contributor

jpeach commented Apr 27, 2020

Thanks for the CVE reference @lmickh

xref envoyproxy/envoy#6767

@jpeach
Copy link
Contributor

jpeach commented Apr 28, 2020

I dug into this some more and read the various RFC specs and history. I'm pretty comfortable saying that this is a duplicate of #1493. If we serve a wildcard certificate, browsers will consider existing connections OK to reuse even if the SNI server names associated with the connections differ, because each origin name will match successfully against the wildcard certificate.

This previously worked in Contour because we did not enforce any binding between the SNI server name and the origin hostname. I'd argue that this was always a mis-feature since it breaks multi-tenancy (tenant services are visible to each other), but in 1.4 we had to fix it so that we could make guarantees about TLS client certificate authentication (we can't allow a non-authenticated session to make requests to an origin that requires authentication). Unfortunately, this means that more users are exposed to the underlying problem originally documented in #1493, since basically anyone using wildcard certificates will be affected.

AFAICT, in the Envoy configuration we generate, there are no security implications. The problem manifests as a 404 response; we never forward a request to an inappropriate origin.

The workaround, as noted above by @lmickh, is to avoid wildcard certificates and get a separate certificate for each hostname. I understand that's not going to be possible for everyone. I expect that the right fix is to convince Envoy to serve a 421 response when the SNI server name doesn't match the hosted origin.

Duplicate of #1493.

@jpeach jpeach closed this as completed Apr 28, 2020
@jpeach
Copy link
Contributor

jpeach commented Apr 28, 2020

Duplicate of #1493

@jpeach jpeach marked this as a duplicate of #1493 Apr 28, 2020
@erwbgy
Copy link
Contributor

erwbgy commented May 21, 2020

It is worth noting that this is not limited to wildcard certificates. It also occurs if the same Secret object is used for more than one FQDN in different Ingress objects, for example. But the message is the same - use different certificates for each FQDN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants