Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for resource fails when gzip is enabled with Envoy as origin for Azure CDN #28828

Closed
mderazon opened this issue Aug 4, 2023 · 11 comments
Labels
area/compression area/http bug stale stalebot believes this issue/PR has not been touched recently

Comments

@mderazon
Copy link

mderazon commented Aug 4, 2023

Note: Copied from here: emissary-ingress/emissary#4821 , we are using Ambassador/Emissary Ingress, but the problem might be related to the underlying Envoy proxy/

Recently, MS has rolled an update to some of their CDN endpoints that are supposed to be more compliant with the RFC about range requests.

When gzip is enabled on Ambassador, request from Azure CDN to Ambassador origin fails. According to MS specialist, the problem is with Ambassador/Envoy.

Like I mentioned, the problem is due to a change on their end, to make their edge endpoints more compliant with the RFC. They are rolling this change gradually and therefor the problem is not yet reproducible in all regions. One example region we know is affected is India/Delhi (if you access the resource from there, there will be an issue).

Here's an example resource:
CDN URL (problem occures): https://chatonce.azureedge.net/assets/images/zoom.svg
Ambassador origin URL (problem doesn't occure) : https://ccgw.oncehub.com/oh-customer-front/assets/images/zoom.svg

The network path goes through Azure CDN/Front Door --> Emissary Ingress (Envoy) --> Next.js service

When accessed through the CDN, with gzip enabled in one of the affected regions, request will stay in "pending" state and will eventually fail.
When the resource is accessed from Ambassador URL directly from browser, it will not fail.

Here's MS response to this:

As we could understand the issue is due to range requests not being correctly handled by the customer origin (Envoy).

When sending a request through Azure Front Door (targeting a proxy env) with the header "Accept-Encoding: identity" the customer origin specifies the correct content length Content-Length: 12870, which is the content length of the payload returned to curl. 

When sending the same request with the header "Accept-Encoding: gzip", we can see that the request hangs and times out.  This issue is due to the customer origin incorrectly handling range requests:
c1f09afc-6056-49f4-a4ce-a90f62885868 (1)

Recently, an Azure Health Advisory went out to all AFD customers. For reference, the health advisory reads as follows:

You're receiving this notification because you've been identified as a user for Azure Front Door classic/standard/premium SKUs.   

Background 

To continuously improve the quality and resiliency of the Azure Front Door platform, we periodically roll out improvements to multiple PoPs across the world. These improvements are rolled out as platform upgrades in a safe deployment manner.   

Change

 As a part of the platform improvement process, we're adding stricter protocol implementation policies for both HTTP and HTTPS between AFD/CDN PoPs and customer origins/backends. In this case, if customer origins/backends aren't following HTTP and TLS protocol as per RFCs (e.g., HTTP/1.1: RFC 7230, 7231, 7232, 7233, 7234; TLS 1.2: RFC 5246), then it may result in incorrect behaviors and degraded experience for end users.

I.g: We recently pushed an upgrade causing more range requests to be sent to customer origins (to have additional efficiency in distributing the caching of large customer resources in Azure Front Door). Unfortunately, certain customer origins didn't properly respond to range requests when the “Accept-Encoding: gzip” header was present and returned an invalid “Content-Range” header value (as an example, Content-Range: bytes 0-923/924 but the actual response body isn't equal to 924 bytes), resulting in failed client requests. If a customer server doesn't properly handle range requests, per RFC it's acceptable to ignore the range header and return a non-range response.  

Recommended action

We strongly recommend you verify HTTP protocol conformance on your backends/origins configured behind your Azure Front Door or Azure CDN profiles according to RFCs 7230, 7231, 7232, 7233, 7234, 5246. If you identify some anomalies, please fix those in a timely manner to avoid potential incidents in your front door or CDN profiles. 

One thing you should verify works properly on the origin/backend is the handling of HTTP range requests. If the origin doesn't have proper support for range requests, it should simply ignore the Range header and return a regular, non-range response (e.g., with status code 200). But if the origin returns a response with status code 206 it implies support for range requests. In that case, it must send a valid Content-Range header. If compression is used, the Content-Range header values must be based on the compressed (encoded) resource size.

For example, suppose an uncompressed resource on the origin is 100 KB: but with gzip compression, it's only 20 KB. Suppose the origin receives a range request for "bytes=0-1048575" (that is, the first 1 MB) and suppose that the header “Accept-Encoding: gzip” is also present. If the origin chooses to return a range response (status code 206) and it chooses to compress the resource (“Content-Encoding: gzip”), it should set the value of the Content-Range to “bytes 0-20479/20480”, indicating that the response contains 20 KB of data (not 100 KB of data).

Here are some additional CURL requests demonstrating the issue:

Good response (Envoy only):

curl -s -o /dev/null -v -H "Accept-Encoding: gzip" https://ccgw.oncehub.com/oh-customer-front/assets/images/zoom.svg
*   Trying 52.184.200.53:443...
* Connected to ccgw.oncehub.com (52.184.200.53) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [102 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [5656 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.oncehub.com
*  start date: Jan 18 00:00:00 2023 GMT
*  expire date: Feb 18 23:59:59 2024 GMT
*  subjectAltName: host "ccgw.oncehub.com" matched cert's "*.oncehub.com"
*  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x55ed78f05560)
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /oh-customer-front/assets/images/zoom.svg HTTP/2
> Host: ccgw.oncehub.com
> user-agent: curl/7.81.0
> accept: */*
> accept-encoding: gzip
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
< HTTP/2 200
< date: Fri, 04 Aug 2023 09:46:26 GMT
< content-type: image/svg+xml
< request-context: appId=cid-v1:
< accept-ranges: bytes
< cache-control: public, max-age=0
< last-modified: Fri, 28 Jul 2023 13:03:31 GMT
< etag: W/"2fb-1899c98dcb8"
< vary: Accept-Encoding
< x-envoy-upstream-service-time: 3
< strict-transport-security: max-age=15724800
< content-encoding: gzip
< server: envoy
<
{ [419 bytes data]
* Connection #0 to host ccgw.oncehub.com left intact

Bad response (Azure CDN + Envoy):

curl -s -o /dev/null -v -H "Accept-Encoding: gzip" https://chatonce.azureedge.net/assets/images/zoom.svg
*   Trying 13.107.246.72:443...
* Connected to chatonce.azureedge.net (13.107.246.72) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [106 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [4695 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=WA; L=Redmond; O=Microsoft Corporation; CN=*.azureedge.net
*  start date: Jul 18 23:00:30 2023 GMT
*  expire date: Jun 27 23:59:59 2024 GMT
*  subjectAltName: host "chatonce.azureedge.net" matched cert's "*.azureedge.net"
*  issuer: C=US; O=Microsoft Corporation; CN=Microsoft Azure TLS Issuing CA 02
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* Using Stream ID: 1 (easy handle 0x55e6fde60560)
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /assets/images/zoom.svg HTTP/2
> Host: chatonce.azureedge.net
> user-agent: curl/7.81.0
> accept: */*
> accept-encoding: gzip
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
< HTTP/2 200
< date: Fri, 04 Aug 2023 09:48:28 GMT
< content-type: image/svg+xml
< content-length: 763
< request-context: appId=cid-v1:
< cache-control: public, max-age=0
< last-modified: Fri, 28 Jul 2023 13:03:31 GMT
< etag: W/"2fb-1899c98dcb8"
< vary: Accept-Encoding
< x-envoy-upstream-service-time: 3
< strict-transport-security: max-age=15724800
< content-encoding: gzip
< x-azure-ref: 20230804T094726Z-tqbhhkr8y16eh4pxhaazm8tumw00000003kg00000002we2d
< x-cache: TCP_MISS
< accept-ranges: bytes
<
* HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
* stopped the pause stream!
* Connection #0 to host chatonce.azureedge.net left intact
@mderazon mderazon added bug triage Issue requires triage labels Aug 4, 2023
@htuch htuch added area/http area/compression and removed triage Issue requires triage labels Aug 4, 2023
@htuch
Copy link
Member

htuch commented Aug 4, 2023

@mattklein123 @ggreenway @alyssawilk for thoughts.

@alyssawilk
Copy link
Contributor

looks like gzip is owned by Matt and @KBaichoo . Kevin any chance you can take a look?

@mderazon
Copy link
Author

mderazon commented Aug 16, 2023

Thank you @alyssawilk, @KBaichoo

Here's some more information about the issue:
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/how-to-test-azure-front-door-origins-for-valid-http-range/ba-p/3745208

And here's a reproduction, according to the article:

curl -v -o ./output.txt --http2 --range "0-1023" -H "Accept-Encoding: gzip" https://ccgw.oncehub.com/oh-customer-front/assets/images/webex-meetings.svg
  1. This URL points to our Envoy (Ambassador / Emissary Ingress) deployment
  2. I've chosen this resource (webex-meetings.svg) as it's more than 1024 bytes

curl response is:

< HTTP/2 206
< date: Wed, 16 Aug 2023 10:32:31 GMT
< content-type: image/svg+xml
< request-context: appId=cid-v1:
< accept-ranges: bytes
< cache-control: public, max-age=0
< last-modified: Wed, 09 Aug 2023 09:41:33 GMT
< etag: W/"e085-189d9ac44c8"
< content-range: bytes 0-1023/57477
< vary: Accept-Encoding
< x-envoy-upstream-service-time: 2
< strict-transport-security: max-age=15724800
< content-encoding: gzip
< server: envoy
<
{ [370 bytes data]

According to the article

If the data size return is not exactly 1024 bytes, the response is invalid.

In this case it's 370 bytes so the response is invalid.

I've also ran the testing tool provided by @DanielLarsenNZ and it confirmed the same.

@mderazon
Copy link
Author

mderazon commented Aug 16, 2023

On second look, I am not sure this behavior is coming from Envoy or the underlying Next.js server that Envoy is routing to.
I am trying to see how to disable Next.js' range requests to test the hypothesis

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 15, 2023
@alphamonkey79
Copy link

My team hit this issue as well while we are currently migrating from Lumen CDN to Azure Front Door. Apache webservers on origin backend. Everything was working fine until we decided that we wanted to enable compression. We are still attempting to work out a solution.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 21, 2023
@mderazon
Copy link
Author

@alphamonkey79 this problem is mostly with MS, see DanielLarsenNZ/nodejs-express-range-headers#1

@alphamonkey79
Copy link

Update...
We found a load balance profile config (BigIP-F5) difference between our environments.

All but one of our environments do not compress.
Our problem environment has an F5 “HTTP Compression Profile” set to “http_compression-all-no-vary”

This seems to be the root cause of our problem and is directly tied to the MS paper mentioned in this thread!
https://techcommunity.microsoft.com/t5/fasttrack-for-azure/how-to-test-azure-front-door-origins-for-valid-http-range/ba-p/3745208

The solution for us should be a removal of this profile on the F5 for the problem environment.

Thank you for listening.

@mscbpi
Copy link

mscbpi commented Oct 20, 2023

Envoy should ignore Range: requests when compression is in place and they are unable to respond a HTTP/206 compliant answer with the right computed Content-Range header.

A fair compliant workaround is to, in that very case of compression, ignore client's Range: request, and answer the whole compressed content in a HTTP/200 response.

  • Handling Range: headers is optional so answering a 200 is OK
  • Answering a HTTP/206 with wrong Content-Range values is notOK.

Meanwhile another workaround is to disable compression and make CDN handle it, but you are right to point out that Envoy should give a compliant HTTP answer in any case.
RFC7233

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 19, 2023
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/compression area/http bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

5 participants